I am trying to look for a minhash open source implementation which I can leverage for my work.
The functionality I need is very simple, given a set as input, the implementation should return its minhash.
A python or C implementation would be preferred, just in case I need to hack it to work for me.
Any pointers would be of great help.
Regards.
A minhash function converts tokenized text into a set of hash integers, then selects the minimum value. This is the equivalent of randomly selecting a token. The function then does the same thing repeatedly with different hashing functions, in effect selecting n random shingles.
By finding many such MinHash values and counting the number of collisions, we can efficiently estimate J(A, B) without explicitly computing the similarities. To compute a MinHash signature of a set A = {a1,a2, ...}, generate a universal hash function U and compute the set of signatures U(A) = {U(a1),U(a2), ...}.
It's given by the number of common items (3) divided by the total number of items (10), or 3/10, the same as the Jaccard similarity. The probability that a given MinHash value will come from one of the shared items is equal to the Jaccard similarity.
You should have a look at the following open source libraries, in order. All of them are in Python, and show how you can calculate document similarity using LSH/MinHash:
lsh
LSHHDC : Locality-Sensitive Hashing based High Dimensional Clustering
MinHash
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With