
The training will be offered in Brooklyn Pretoria from 14 - 17 September 2010. Here's how it will work:
For more information visit our Extended Edition page, or drop us a note via training[at]sensepost[dot]com.
Last week we presented an invited talk at the ISSA conference on the topic of online privacy (embedded below, click through to SlideShare for the original PDF.)
The talk is an introductory overview of Privacy from a Security perspective and was prompted by discussions between security & privacy people along the line of "Isn't Privacy just directed Security? Privacy is to private info what PCI is to card info?" It was further prompted by discussion with Joe the Plumber along the lines of "Privacy is dead!"
The talk, is unfortunately best delivered as a talk, and not as standalone slides, so here's some commentary:
We start off the problem statement describing why privacy has grown in importance. The initial reactions were based on new technology allowing new types of information to be captured and disseminated. While the example given is from the 1980s, the reaction is a recurring one, as we've seen with each release of new tech (some examples: Cameras, Newspapers, Credit Cards, The Internet, Facebook). Reactions are worsened by the existence of actors with the funding & gall to collect and collate much information to further potentially disagreeable goals (usually Governments). However, the new threat is that there has been a fundamental shift in the way in which we live our lives, where information about us is no longer merely *recorded* online, but rather, our lives are *lived* on line. It is quite possible that for an average day, from waking up to going to sleep, a significant number of the actions you perform will not only be conducted (in part) online, but that it is possible for them to be conducted using the services of one service provider. My intention is not to beat up on Google, but rather use them as an example. They are a pertinent example, as every business book seems to use them as one. The, arguably, most successful corporation of our current age's primary business model is the collection & monetisation of private data. Thus, while Google is the example, there are and will be many followers.
The next section moves into providing a definition of privacy, and attempts to fly through some fairly dry aspects of philosophy, law & psychology. We've done some entry-level work on collating the conception of privacy across history and these fields, however, brighter minds, such as Daniel Solove and Kamil Reddy have done better jobs of this. In particular, Solove's paper "I've got nothing to hide", and other misconception of privacy is a good introductory read. The key derived point however, is that private data is data with an implied access control & authorised use. Which of the implied access controls & authorised uses are reasonable to enforce or can be legally enforced is a developing field.
As the talk is about "Online Privacy" the talk moves into a description of the various levels at which private data is collected, what mechanisms are used to attempt to collect that data, and what sort of data can be gleaned. It was an academic conference, so I threw in the word "taxonomy." Soon, it will be more frequently quoted than Maslow's Hierarchy, any day now.
At each level, a brief demonstration of non-obvious leaks and their implications was demonstrated. From simple techniques such as cross-site tracking using tracking pixels or cookies, to exploit of rich browser environments such as the simple CSS history hack, to less structured and less obvious leaks such as search data (as demonstrated by the AOL leak), moving to deanonymisation of an individual by correlating public data sets (using the awesome Maltego) and finally to unintended leaks provided by meta-data (through analysis of twitter & facebook friends groups).
Finally, a mere two slides are used to explain some of the implications and defenses. These are incomplete and are the current area of research I'm engaged in.
[Update: Disclosure and other points discussed in a little more detail here.]
We choose to look at memcached, a "Free & open source, high-performance, distributed memory object caching system" 1. It's not outwardly sexy from a security standpoint and it doesn't have a large and exposed codebase (total LOC is a smidge over 11k). However, what's of interest is the type of applications in which memcached is deployed. Memcached is most often used in web application to speed up page loads. Sites are almost2 always dynamic and either have many clients (i.e. require horizontal scaling) or process piles of data (look to reduce processing time), or oftentimes both. This implies that the sites that use memcached contain more interesting info than simple static sites, and are an indicator of a potentially interesting site. Prominent users of memcached include LiveJournal (memcached was originally written by Brad Fitzpatrick for LJ), Wikipedia, Flickr, YouTube and Twitter.
I won't go into how memcached works, suffice it to say that since data tends to be read more often than written in common use cases the idea is to pre-render and store the finalised content inside the in-memory cache. When future requests ask for the page or data, it doesn't need to be regenerated but can be simply regurgitated from the cache. Their Wiki contains more background.
insurrection:demo marco$ ruby go-derper.rb -f x.x.x.x
[i] Scanning x.x.x.x
x.x.x.x:11211
==============================
memcached 1.4.5 (1064) up 54:10:01:27, sys time Wed Aug 04 10:34:36 +0200 2010, utime=369388.17, stime=520925.98
Mem: Max 1024.00 MB, max item size = 1024.00 KB
Network: curr conn 18, bytes read 44.69 TB, bytes written 695.93 GB
Cache: get 514, set 93.41b, bytes stored 825.73 MB, curr item count 1.54m, total items 1.54m, total slabs 3
Stats capabilities: (stat) slabs settings items (set) (get)
44 terabytes read from the cache in 54 days with 1.5 million items stored? This cache is used quite frequently. There's an anomaly here in that the cache reports only 514 reads with 93 billion writes; however it's still worth exploring if only for the size.
We can run the same fingerprint scan against multiple hosts using
ruby go-derper.rb -f host1,host2,host3,...,hostn
or, if the hosts are in a file (one per line):
ruby go-derper.rb -F file_with_target_hosts
Output is either human-readable multiline (the default), or CSV. The latter helps for quickly rearranging and sorting the output to determine potential targets, and is enabled with the "-c" switch:
ruby go-derper.rb -c csv -f host1,host2,host3,...,hostn
Lastly, the monitor mode (-m) will loop forever while retrieving certain statistics and keep track of differences between iterations, in order to determine whether the cache appears to be in active use.
insurrection:demo marco$ ruby go-derper.rb -l -s x.x.x.x
[w] No output directory specified, defaulting to ./output
[w] No prefix supplied, using "run1"
This will extract data from the cache in the form of a key and its value, and save the value in a file under the "./output" directory by default (if this directory doesn't exist then the tool will exit so make sure it's present.) This means a separate file is created for every retrieved value. Output directories and file prefixes are adjustable with "-o" and "-r" respectively, however it's usually safe to leave these alone.
By default, go-derper fetches 10 keys per slab (see the memcached docs for a discussion on slabs; basically similar-sized entries are grouped together.) This default is intentionally low; on an actual assessment this could run into six figures. Use the "-K" switch to adjust:
ruby go-derper.rb -l -K 100 -s x.x.x.x
As mentioned, retrieved data is stored in the "./ouput" directory (or elsewhere if "-o" is used). Within this directory, each new run of the tool produces a set of files prefixed with "runN" in order to keep multiple runs separate. The files produced are:
At this point, there will (hopefully) be a large number of files in your output directory, which may contain useful info. Start grepping.
What we found with a bit of field experience was that mining large caches can take some time, and repeating grep gets quite boring. The tool permits you to supply your own set of regular expressions which will be applied to each retrieved value; matches are printed to the screen and this provides a scroll-by view of bits of data that may pique your interest (things like URLs, email addresses, session IDs, strings starting with "user", "pass" or "auth", cookies, IP addresses etc). The "-R" switch enables this feature and takes a file containing regexes as its sole argument:
ruby go-derper.rb -l -K 100 -R regexs.txt -s x.x.x.x
ruby go-derper.rb -w output/run1-e94aae85bd3469d929727bee5009dddd
This syntax is simple since go-derper will figure out the target server and key from the run's index file.
2 We're hedging here, but we've not come across a static memcached site.
3 If so, you may be as surprised as we were in finding this many open instances.
Wow. At some point our talk hit HackerNews and then SlashDot after swirling around the Twitters for a few days. The attention is quite astounding given the relative lack of technical sexiness to this; explanations for the interest are welcome!
We wanted to highlight a few points that didn't make the slides but were mentioned in the talk:
The potential risk assigned to exposed memcacheds hasn't as yet been publicly demonstrated so it's unsurprising that you'll find memcacheds around. I imagine this issue will flare and be hunted into extinction, at least on the public interwebs.
Lastly, the major interest seems to be on mining data from exposed caches. An equally disturbing issue is overwriting entries in the cache and this shouldn't be underestimated.