Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you suggest a good minhash implementation?

I am trying to look for a minhash open source implementation which I can leverage for my work.

The functionality I need is very simple, given a set as input, the implementation should return its minhash.

A python or C implementation would be preferred, just in case I need to hack it to work for me.

Any pointers would be of great help.

Regards.

like image 280
Atish Kathpal Avatar asked Jan 26 '13 03:01

Atish Kathpal


People also ask

How does MinHash work?

A minhash function converts tokenized text into a set of hash integers, then selects the minimum value. This is the equivalent of randomly selecting a token. The function then does the same thing repeatedly with different hashing functions, in effect selecting n random shingles.

How is MinHash signature calculated?

By finding many such MinHash values and counting the number of collisions, we can efficiently estimate J(A, B) without explicitly computing the similarities. To compute a MinHash signature of a set A = {a1,a2, ...}, generate a universal hash function U and compute the set of signatures U(A) = {U(a1),U(a2), ...}.

How do you calculate MinHash?

It's given by the number of common items (3) divided by the total number of items (10), or 3/10, the same as the Jaccard similarity. The probability that a given MinHash value will come from one of the shared items is equal to the Jaccard similarity.


1 Answers

You should have a look at the following open source libraries, in order. All of them are in Python, and show how you can calculate document similarity using LSH/MinHash:

lsh
LSHHDC : Locality-Sensitive Hashing based High Dimensional Clustering
MinHash

like image 143
Nilesh Avatar answered Oct 21 '22 22:10

Nilesh