programmer’s trouble: Parsing 10TB of Metadata, 26M Domain Names and 1.4M SSL Certs for $10 on AWS
See the parts below:
My approach was somewhat different from the github project above, instead of using the AIA extension I wanted to brute-force the solution by finding all known intermediate and root certificates in advance. Based on the checksum of the issuer/subject fields I could look up which certificates “claimed” to be the signer of the certificate and then using the signature I could filter out which ones actually were. You can use my online service like this:
cat CERT.crt | curl -F myfile=@- -X POST https://certinator.cfapps.io/chain/
You then get the entire chain back on your prompt.
Brute-forcing always leads to interesting optimization challenges. In this case I needed to scrape as many certificates from the SNI-enabled web as I could, and for that I needed to find the longest list of domain names on the Internet available. There is the Alexa top 1 million, but I thought something better should be available. I could not find anything as a direct download, but I did stumble upon the Common Crawl project. They scan a huge amount of urls and publish the results on S3. Luckily they split the metadata (headers etc) from the actual data, and the url is listed in the metadata. This could be filtered down to a list of all urls, which could be filtered down to a list of all domain names. Certainly an approach that is promising as I don’t need the entire data set, just the metadata will suffice.
Still, the amount of metadata of one scan is about 10TB. Downloading that from my home with a 60Mb connection would take about 15 days to download and would probably violate the Fair Use Policy of my ISP.
However, the data is on S3 and AWS offers 32 core, 10Gb instances with free data transfer to the c3.8xlarge machine. This leads to a theoretical processing time of 2 hours if the network is the bottleneck, or about a 100x speedup from my home connection! The machine costs about $1.70 per hour. I would not need any hadoop/map-reduce or any type of clustering to get results.
Now I’m a big fan of using simple unix commands to get complex tasks done. Using the “split” command I split out the list of metadata files into separate chunks of about 100 files. Each line of these chunks were processed with a pipeline like this
curl -s http://aws-publicdatasets.s3.amazonaws.com/$i | zgrep '^WARC-Target-URI:' | cut -d'/' -f 3 | cut -d'?' -f 1 | sort | uniq
This filters a 150MB gzipped file down to about 100KB of raw domain names. I used a bit of python with gevent for job scheduling, but looking back “xargs -n 1 -P 50” would have done just as well. Even though the chunk results were only 100KB, after processing thousands of them the 8GB root partition was filling up, so I started “rsync –remove-source-files” in a loop to transfer the result files to my laptop. I could have stored the results on the EBS volume, but then after all hard work was done I would need to wait until the results (10GB) were downloaded to my laptop. Better download them incrementally and throw away the $1.70/hour machine the minute I was done with it. Not that an extra $1.70 would break the bank, but I challenged myself to keeping the efficiency at a maximum.
I used “nmon” to determine the optimal amount of parallellization, which turned out to be about 30 simultaneous downloads. The bottleneck turned out the be CPU rather than network, by a small margin. Unpacking the .gz files is expensive! The 30 gzip commands were at the top in “top”. At 30 simultaneous downloads I was doing 95% CPU on 32 cores and about 5Gb of network traffic. Initially I launched the in the wrong region so latency to S3 was pretty high, that had a pretty big influence of performance but I didn’t write down the exact numbers. For the record, the data set is hosted in us-east-1.
At my day job I regularly work with about 30 physical machines with 288GB RAM and 16 cores each, but I think for those 4 hours in May this “little” VM did more than all that infrastructure combined.
cat segments/* | sort | uniq
would be preferable. The problem was that the original data set containing all duplicates did not fit in my laptop’s memory and “sort” needs to keep the whole thing in memory, so I solved it with a bit of python and a marisa_trie where duplicates are removed. Looking back, maybe “sort -u” has the same optimization? Anyway, I also learned that setting “LC_ALL=C” significantly speeds up “sort” (about 4x in my case) and can be used if you’re not using special characters.
So now I had 26 million domain names. How do we get the X509 certificates from those servers? I created a small program in go that does just that:
I executed it with
cat domainnames | go run fetch.go
This ran for a couple of hours on my laptop with quite low CPU and network usage. It does nothing but a SNI enabled SSL/TSL handshake and stores all the certificates under a filename that’s the sha1 sum.
Now I had about 1.4 million X509 certificates.
Using the output of
tar -zxOf certificates.tar.gz | go run print-certs.go > hashes
I was able to find some nice things:
- People reusing the same public/private key across multiple certificates. Some guy in Russia did this 174 times.
sort hashes | uniq -w 30 -c -d | sort -n | tail
- Top 3 CAs are 3: RapidSSL CA, 2: COMODO RSA Domain Validation Secure Server CA, 1: Go Daddy Secure Certificate Authority – G2.
cat hashes | cut -f 4 -d ' ' | sort | uniq -c | sort -n | tail
- 59693 self-signed certificates with subject
C=US, ST=Virginia, L=Herndon, O=Parallels, OU=Parallels Panel, CN=Parallels Panel/emailAddress=info@parallels.com
If you have a any questions or ideas for improvement, I’d like to hear it in the comments on Hacker News!