The interesting of simhash algorithm is its two properties:
Properties of simhash: Note that simhash possesses two conicting properties: (A) The fingerprint of a document is a “hash” of its features, and (B) Similar documents have similar hash values.
importrefromsimhashimportSimhashdefget_features(s):width=3s=s.lower()s=re.sub(r'[^\w]+','',s)return[s[i:i+width]foriinrange(max(len(s)-width+1,1))]print'%x'%Simhash(get_features('How are you? I am fine. Thanks.')).valueprint'%x'%Simhash(get_features('How are u? I am fine. Thanks.')).valueprint'%x'%Simhash(get_features('How r you?I am fine. Thanks.')).value
Use the SimhashIndex to query near duplicates objects in a very efficient way.
1234567891011121314151617181920212223
importrefromsimhashimportSimhash,SimhashIndexdefget_features(s):width=3s=s.lower()s=re.sub(r'[^\w]+','',s)return[s[i:i+width]foriinrange(max(len(s)-width+1,1))]data={1:u'How are you? I Am fine. blar blar blar blar blar Thanks.',2:u'How are you i am fine. blar blar blar blar blar than',3:u'This is simhash test.',}objs=[(str(k),Simhash(get_features(v)))fork,vindata.items()]index=SimhashIndex(objs,k=3)printindex.bucket_size()s1=Simhash(get_features(u'How are you i am fine. blar blar blar blar blar thank'))printindex.get_near_dups(s1)index.add('4',s1)printindex.get_near_dups(s1)