Javier had a simple shell script he posted to our internal chat a few days ago. It’s goal was to pull all the IP ranges for a country in preparation for a footprint from https://ipinfo.io/ (Let’s use PL as an example). Given this involved pulling multiple webpages, I was interested to know what the most efficient approach to this in the shell would be. Truthfully, the actual problem, pulling data from the site or gathering BGP routes, didn’t interest me, I wanted to look at how to do mass HTTP enum most efficiently with curl.
As with all shell scripts, his initial approach had some … problems, and took over 4hrs to pull all the data, we’ll leave that version out, and use the following one Javier, Rogan and I came up with as the baseline:
seq 1 3 \ | xargs -I% curl -s "https://ipinfo.io/countries/pl/%" \ | grep -oE "AS[0-9]{1,9}" \ | sort -u \ | xargs -I% curl -s "https://ipinfo.io/%" \ | grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \ | sort -u
The script:
- Fetches three pages: https://ipinfo.io/countries/pl/1, https://ipinfo.io/countries/pl/2 & https://ipinfo.io/countries/pl/3
- Then grep’s out the routing AS numbers e.g. AS5617
- Each of those have their details fetched e.g. https://ipinfo.io/AS5617
- Those are then grep’ed for CIDR addresses e.g. 178.42.0.0/15.
If you aren’t familiar with xargs, it simple lets you execute something across an input e.g.:
> ls foo bar baz > file * foo: empty bar: empty baz: empty > ls | xargs file foo: empty bar: empty baz: empty > ls | xargs echo file file foo bar baz > ls | xargs -L1 echo file file foo file bar file baz
The -L1 just says to execute it for each item, this is necessary to understand for later.
The Test Environment
I didn’t want to keep pulling hundreds of megs of data from ipinfo, so I pulled all the data to my own machine, and served it with npm’s http-server. This means I just replaced the links with https://127.0.0.1:8080/. It also meant I could remove the inconsistencies added by moving data over the Internet. All in all, the 5 706 files were 276M of raw bytes. The vast majority of them were the AS pages, which averaged 49K in size with the largest being 892K and the smallest being 12K.
To record the execution of the script, I used the Unix time utility. To record the network traffic I used tshark. I also used tshark to generate stats with its “-z conv,ip” summary.
So, our base run, executed on my MBP, looks like this, with the important summary at the end.
seq 1 3 0.00s user 0.00s system 44% cpu 0.010 total xargs -I% curl -s "http://127.0.0.1:8080/%" 0.02s user 0.02s system 40% cpu 0.108 total grep --color -oE "AS[0-9]{1,9}" 0.03s user 0.00s system 33% cpu 0.108 total sort -u 0.01s user 0.01s system 14% cpu 0.113 total xargs -I% curl -s "http://127.0.0.1:8080/%" 35.07s user 30.07s system 50% cpu 2:07.74 total grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' 16.09s user 0.19s system 12% cpu 2:07.74 total sort -u 0.15s user 0.06s system 0% cpu 2:07.94 total 2:07.94 total time 94 764 frames 295 926 465 bytes
You can see this took over 2 mins to execute, generated 94k frames, and 282M of data.
Approach 1: Parallelism
Multi processing a problem is often the easiest way to speed something up. This can simply be done by passing the -P switch to xargs above. Using an arbitrary value of 20 parallel processes gives the following code:
time seq 1 3\ | xargs -P20 -I% curl -s "http://127.0.0.1:8080/%" \ | grep -oE "AS[0-9]{1,9}" \ | sort -u \ | xargs -P20 -I% curl -s "http://127.0.0.1:8080/%" \ | grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \ | sort -u
This produces the following stats:
seq 1 3 0.00s user 0.00s system 46% cpu 0.009 total xargs -P20 -I% curl -s "http://127.0.0.1:8080/%" 0.03s user 0.03s system 83% cpu 0.068 total grep --color -oE "AS[0-9]{1,9}" 0.03s user 0.00s system 56% cpu 0.068 total sort -u 0.01s user 0.00s system 22% cpu 0.073 total xargs -P20 -I% curl -s "http://127.0.0.1:8080/%" 48.46s user 46.47s system 304% cpu 31.160 total grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' 23.50s user 0.58s system 77% cpu 31.163 total sort -u 0.16s user 0.05s system 0% cpu 31.337 total 31.337 total time 94 777 frames 296 116 785 bytes
Excellent, that cuts the time down to a quarter of the original, and should have no impact on the number of frames or the bytes transferred, but in reality, that stuff is slightly less deterministic.
Approach 2: Pipelining
Most HTTP servers support pipelining, where multiple requests are made and served within the same TCP connection. This saves having to set up and tear down a new TCP connection each time.
Rogan figured out a neat approach for pipelining with curl using the config option, the resulting code looks like this:
time curl -s --config \ <(for x in \ $(curl -s --config \ <(for i in `seq 1 3` do echo "url=http://127.0.0.1:8080/$i" done) \ | grep -oE "AS[0-9]{1,9}" \ | sort -u) do echo "url=http://127.0.0.1:8080/$x" done) \ | grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \ | sort -u
This creates a list of url=… with newlines between them, and feeds that to curl as a config file with a shell redirect <.
It’s a bit less readable, and much of the execution time gets hidden in the first curl process, but let’s check how it performs:
curl -s --config 0.69s user 0.60s system 9% cpu 14.336 total grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' 13.97s user 0.10s system 98% cpu 14.339 total sort -u 0.14s user 0.05s system 1% cpu 14.522 total 14.522 total time 44 867 frames 292 995 529 bytes
Whoa, that’s half the time and more than half the number of frames of the multi process version. It’s only 1% smaller when looking at total bytes though. So the extra packets really do provide a lot of processing overhead, despite not resulting in much of a data increase.
Approach 3: Hybrid
Next up I wanted to if we could combine both approaches, which lead to some fun scripting. Here’s the result:
for x in $(for i in $(seq 1 3) do echo "http://127.0.0.1:8080/$i" done \ | xargs -L1 -P3 curl -s \ | grep -oE "AS[0-9]{1,9}" \ | sort -u) do echo "http://127.0.0.1:8080/$x" done \ | xargs -L287 -P20 curl -s \ |grep -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' \ | sort -u
This doesn’t use Rogan’s nifty –config trick for curl, but rather just passes the URLs as input, and controls how many URLs will be pipelined with -L. I think it’s a little more readable. Given the first loop pulls three pages, it makes sense to do those simultaneously, hence -L1 -P3. For the latter one, I took the 5 703 pages that would need to be fetched, and divided them by our 20 processes from earlier, which gave me the 287.
This gives the following results:
for x in ; do; echo "http://127.0.0.1:8080/$x"; done 0.13s user 0.06s system 128% cpu 0.146 total xargs -L287 -P20 curl -s 0.83s user 1.16s system 13% cpu 14.541 total grep --color -Eo '[0-9\.]{7,15}\/[0-9]{1,2}' 13.64s user 0.28s system 95% cpu 14.543 total sort -u 0.13s user 0.05s system 1% cpu 14.704 total 14.704 total time 48 318 frames 293 187 882 bytes
So this ends up with much the same performance as the pipelined version, with a few thousand more frames and a few hundred more kilobytes.
Conclusion
Test | Base | Parallel | Pipelined | Hybrid |
Time m:s.ms | 2:07.94 | 31.337 | 14.522 | 14.704 |
Frames | 94 764 | 94 777 | 44 867 | 48 318 |
Data M | 282.21 | 282.39 | 279.42 | 279.60 |
If you need to make lots of smaller HTTP requests, use curl’s pipelining for maximum speed and bandwidth. If you find yourself needing a quick speed up for your average shell script, xargs -P is your friend. You could also use parallel (which ended up much slower in these tests), which would let you run the work across multiple hosts. Hopefully, you also picked up some shell script-fu.