Slashdot picked up on the blog post from Light Blue TouchPaper commenting on the fact that a researcher was suprised to discover that simply putting an md5 hash into google returned a hit with a mapping to the original word..
This is an interesting concept.. A while back, we decided to fiddle with the concept of using googles indexing and spidering as a new take on the time/space trade-off for password cracking..
We did:
A simple cgi script that accepts a single parameter.. We then use url re-writing to make the script look less scripty and more crawler friendly.
A quick check on the internet shows that google indexes 100k into a document, so our CGI sits around doing nothing, till its first visited:
Once it is, it generates all chars from a..ZZZZZ and prints them along with their md5 hash:
So if you hit: https://secure.sensepost.com/sp-hash/a, you would get:
Now since google only indexes upto a certain point in the doc, its useless filling this page with all of the hashes, so at 100k we stop, and if the char at that point is abc, the cgi then creates a link to itself with abc as the param.. (in our picture it stops at pnt)
The crawler hits that link, effectively hitting and seeding the same cgi, which then keeps going ad-infinitum..
This can be tested, so a quick google for site:secure.sensepost.com + adog will return:
(you can also use google webmaster tools to pre-seed the spider)
Unfortunately i never got back to it, but noticed that while google did index the full charset a..zzzzz at a point some hits dissapeared.. im not sure if this is due to filtering on some of the words that emerged or simply not enough link credibility..
I suspect that if the problem is the latter, it could be fixed by more ppl picking up seeds.. in this plan.. multiple ppl would run the cgi, and a type of delegation can be set up.. so while google is indexing me from a..zzz its indexing someone else from zzz..ZZZ etc.. at just the cost of bandwidth, this would give useful results..
Ah well..