Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Locality Sensitive Hash Implementation? [closed]

Tags:

Are there any relatively simple to understand (and simple to implement) locality-sensitive hash examples in C/C++/Java/C#?

I'd like to learn more about the concept and so want to try an implementation on a few text files just to see how it works, so I don't need anything high-performance or anything... just an example of a hash function that returns similar hashes for similar inputs. I can learn more from it by example afterwards. :)

like image 256
user541686 Avatar asked Apr 24 '11 10:04

user541686


People also ask

What is meant by locality sensitive hashing?

In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability.

What are the advantages of locality sensitive hashing?

Locality Sensitive Hashing (LSH) is one of the most popular techniques for finding approximate nearest neighbor searches in high-dimensional spaces. The main benefits of LSH are its sub-linear query performance and theoretical guarantees on the query accuracy.

Where is locality sensitive hashing used?

LSH has many applications, including: Near-duplicate detection: LSH is commonly used to deduplicate large quantities of documents, webpages, and other files. Genome-wide association study: Biologists often use LSH to identify similar gene expressions in genome databases.


1 Answers

For strings you can use approximate matching algorithm.

  • Generate a random string
  • For all the strings compute their distance from that random shared string using an algorithm like http://www.dotnetperls.com/levenshtein

If the strings are equidistant from a reference string then chances are that they are similar to each other. And there you go you have a locality senitive hash implementation for strings.

You can create different hash buckets for a range of distances.

EDIT: You can try other variations of string distance. A simpler algorithm would just return no. of common characters between two strings.

like image 133
Muhammad Hasan Khan Avatar answered Oct 04 '22 22:10

Muhammad Hasan Khan