Recently I’m reading an exellent paper: Detecting Near-Duplicates for Web Crawling, by Gurmeet Singh Manku, Arvind Jain and Anish Das Sarma.
The interesting of simhash algorithm is its two properties:
Properties of simhash: Note that simhash possesses two conicting properties: (A) The fingerprint of a document is a “hash” of its features, and (B) Similar documents have similar hash values.
Maybe it’s because of the beauty of the algorithm, I find myself implementing it. https://github.com/leonsim/simhash