Efficient HTTP Scripting in the Shell

Reading time ~8 min

Posted by Dominic White on 05 September 2018

Categories: Bash, Curl, Efficiency, Experiment, Shell

Javier had a simple shell script he posted to our internal chat a few days ago. It’s goal was to pull all the IP ranges for a country in preparation for a footprint from https://ipinfo.io/ (Let’s use PL as an example). Given this involved pulling multiple webpages, I was interested to know what the most efficient approach to this in the shell would be. Truthfully, the actual problem, pulling data from the site or gathering BGP routes, didn’t interest me, I wanted to look at how to do mass HTTP enum most efficiently with curl.

As with all shell scripts, his initial approach had some … problems, and took over 4hrs to pull all the data, we’ll leave that version out, and use the following one Javier, Rogan and I came up with as the baseline:

seq 1 3 \
| xargs -I% curl -s "https://ipinfo.io/countries/pl/%" \
| grep -oE "AS[0-9]{1,9}" \
| sort -u \
| xargs -I% curl -s "https://ipinfo.io/%" \
| grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \
| sort -u

The script:

Fetches three pages: https://ipinfo.io/countries/pl/1, https://ipinfo.io/countries/pl/2 & https://ipinfo.io/countries/pl/3
Then grep’s out the routing AS numbers e.g. AS5617
Each of those have their details fetched e.g. https://ipinfo.io/AS5617
Those are then grep’ed for CIDR addresses e.g. 178.42.0.0/15.

If you aren’t familiar with xargs, it simple lets you execute something across an input e.g.:

> ls
foo bar baz
> file *
foo: empty
bar: empty
baz: empty
> ls | xargs file
foo: empty
bar: empty
baz: empty
> ls | xargs echo file
file foo bar baz
> ls | xargs -L1 echo file
file foo
file bar
file baz

The -L1 just says to execute it for each item, this is necessary to understand for later.

The Test Environment

I didn’t want to keep pulling hundreds of megs of data from ipinfo, so I pulled all the data to my own machine, and served it with npm’s http-server. This means I just replaced the links with https://127.0.0.1:8080/. It also meant I could remove the inconsistencies added by moving data over the Internet. All in all, the 5 706 files were 276M of raw bytes. The vast majority of them were the AS pages, which averaged 49K in size with the largest being 892K and the smallest being 12K.

To record the execution of the script, I used the Unix time utility. To record the network traffic I used tshark. I also used tshark to generate stats with its “-z conv,ip” summary.

So, our base run, executed on my MBP, looks like this, with the important summary at the end.

seq 1 3  0.00s user 0.00s system 44% cpu 0.010 total
xargs -I% curl -s "http://127.0.0.1:8080/%"  0.02s user 0.02s system 40% cpu 0.108 total
grep --color -oE "AS[0-9]{1,9}"  0.03s user 0.00s system 33% cpu 0.108 total
sort -u  0.01s user 0.01s system 14% cpu 0.113 total
xargs -I% curl -s "http://127.0.0.1:8080/%"  35.07s user 30.07s system 50% cpu 2:07.74 total
grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}'  16.09s user 0.19s system 12% cpu 2:07.74 total
sort -u  0.15s user 0.06s system 0% cpu 2:07.94 total

2:07.94 total time
94 764 frames
295 926 465 bytes

You can see this took over 2 mins to execute, generated 94k frames, and 282M of data.

Approach 1: Parallelism

Multi processing a problem is often the easiest way to speed something up. This can simply be done by passing the -P switch to xargs above. Using an arbitrary value of 20 parallel processes gives the following code:

time seq 1 3\                     
| xargs -P20 -I% curl -s "http://127.0.0.1:8080/%" \
| grep -oE "AS[0-9]{1,9}" \
| sort -u \                  
| xargs -P20 -I% curl -s "http://127.0.0.1:8080/%" \
| grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \
| sort -u

This produces the following stats:

seq 1 3  0.00s user 0.00s system 46% cpu 0.009 total
xargs -P20 -I% curl -s "http://127.0.0.1:8080/%"  0.03s user 0.03s system 83% cpu 0.068 total
grep --color -oE "AS[0-9]{1,9}"  0.03s user 0.00s system 56% cpu 0.068 total
sort -u  0.01s user 0.00s system 22% cpu 0.073 total
xargs -P20 -I% curl -s "http://127.0.0.1:8080/%"  48.46s user 46.47s system 304% cpu 31.160 total
grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}'  23.50s user 0.58s system 77% cpu 31.163 total
sort -u  0.16s user 0.05s system 0% cpu 31.337 total

31.337 total time
94 777 frames
296 116 785 bytes

Excellent, that cuts the time down to a quarter of the original, and should have no impact on the number of frames or the bytes transferred, but in reality, that stuff is slightly less deterministic.

Approach 2: Pipelining

Most HTTP servers support pipelining, where multiple requests are made and served within the same TCP connection. This saves having to set up and tear down a new TCP connection each time.

Rogan figured out a neat approach for pipelining with curl using the config option, the resulting code looks like this:

time curl -s --config \
  <(for x in \
    $(curl -s --config \
      <(for i in `seq 1 3`
          do echo "url=http://127.0.0.1:8080/$i"
        done) \
      | grep -oE "AS[0-9]{1,9}" \
      | sort -u)
    do echo "url=http://127.0.0.1:8080/$x"
  done) \
| grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \
| sort -u

This creates a list of url=… with newlines between them, and feeds that to curl as a config file with a shell redirect <.

It’s a bit less readable, and much of the execution time gets hidden in the first curl process, but let’s check how it performs:

curl -s --config 0.69s user 0.60s system 9% cpu 14.336 total
grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' 13.97s user 0.10s system 98% cpu 14.339 total
sort -u 0.14s user 0.05s system 1% cpu 14.522 total

14.522 total time
44 867 frames
292 995 529 bytes

Whoa, that’s half the time and more than half the number of frames of the multi process version. It’s only 1% smaller when looking at total bytes though. So the extra packets really do provide a lot of processing overhead, despite not resulting in much of a data increase.

Approach 3: Hybrid

Next up I wanted to if we could combine both approaches, which lead to some fun scripting. Here’s the result:

for x in $(for i in $(seq 1 3)
      do echo "http://127.0.0.1:8080/$i"
    done \
    | xargs -L1 -P3 curl -s \
    | grep -oE "AS[0-9]{1,9}" \
    | sort -u)
  do echo "http://127.0.0.1:8080/$x"
done \
  | xargs -L287 -P20 curl -s \
  |grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \
  | sort -u

This doesn’t use Rogan’s nifty –config trick for curl, but rather just passes the URLs as input, and controls how many URLs will be pipelined with -L. I think it’s a little more readable. Given the first loop pulls three pages, it makes sense to do those simultaneously, hence -L1 -P3. For the latter one, I took the 5 703 pages that would need to be fetched, and divided them by our 20 processes from earlier, which gave me the 287.

This gives the following results:

for x in ; do; echo "http://127.0.0.1:8080/$x"; done 0.13s user 0.06s system 128% cpu 0.146 total
xargs -L287 -P20 curl -s 0.83s user 1.16s system 13% cpu 14.541 total
grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' 13.64s user 0.28s system 95% cpu 14.543 total
sort -u 0.13s user 0.05s system 1% cpu 14.704 total

14.704 total time
48 318 frames
293 187 882 bytes

So this ends up with much the same performance as the pipelined version, with a few thousand more frames and a few hundred more kilobytes.

Conclusion

Test	Base	Parallel	Pipelined	Hybrid
Time m:s.ms	2:07.94	31.337	14.522	14.704
Frames	94 764	94 777	44 867	48 318
Data M	282.21	282.39	279.42	279.60

If you need to make lots of smaller HTTP requests, use curl’s pipelining for maximum speed and bandwidth. If you find yourself needing a quick speed up for your average shell script, xargs -P is your friend. You could also use parallel (which ended up much slower in these tests), which would let you run the work across multiple hosts. Hopefully, you also picked up some shell script-fu.

Our Blog

The Test Environment

Approach 1: Parallelism

Approach 2: Pipelining

Approach 3: Hybrid

Conclusion