Next Spaceship

A Python Implementation Of Simhash Algorithm

| Comments

Recently I’m reading an exellent paper: Detecting Near-Duplicates for Web Crawling, by Gurmeet Singh Manku, Arvind Jain and Anish Das Sarma.

The interesting of simhash algorithm is its two properties:

Properties of simhash: Note that simhash possesses two conicting properties: (A) The fingerprint of a document is a "hash" of its features, and (B) Similar documents have similar hash values.

Maybe it’s because of the beauty of the algorithm, I find myself implementing it. https://github.com/leonsim/simhash

Getting Started

pip install simhash
(or: easy_install simhash)

View simhash value:

import re
from simhash import Simhash
def get_features(s):
    width = 3
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]

print '%x' % Simhash(get_features('How are you? I am fine. Thanks.')).value
print '%x' % Simhash(get_features('How are u? I am fine.     Thanks.')).value
print '%x' % Simhash(get_features('How r you?I    am fine. Thanks.')).value

Get distance of two simhash:

from simhash import Simhash
print Simhash('aa').distance(Simhash('bb'))
print Simhash('aa').distance(Simhash('aa'))

Use the Simhash Index

Use the SimhashIndex to query near duplicates objects in a very efficient way.

import re
from simhash import Simhash, SimhashIndex
def get_features(s):
    width = 3
    s = s.lower()
    s = re.sub(r'[^\w]+', '', s)
    return [s[i:i + width] for i in range(max(len(s) - width + 1, 1))]

data = {
    1: u'How are you? I Am fine. blar blar blar blar blar Thanks.',
    2: u'How are you i am fine. blar blar blar blar blar than',
    3: u'This is simhash test.',
}
objs = [(str(k), Simhash(get_features(v))) for k, v in data.items()]
index = SimhashIndex(objs, k=3)

print index.bucket_size()

s1 = Simhash(get_features(u'How are you i am fine. blar blar blar blar blar thank'))
print index.get_near_dups(s1)

index.add('4', s1)
print index.get_near_dups(s1)

Source Code

View Source Code at GitHub.

Comments