Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fast similarity detection

I have a large collection of objects and I need to figure out the similarities between them.

To be exact: given two objects I can compute their dissimilarity as a number, a metric - higher values mean less similarity and 0 means the objects have identical contents. The cost of computing this number is proportional to the size of the smaller object (each object has a given size).

I need the ability to quickly find, given an object, the set of objects similar to it.

To be exact: I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d, such that listing the objects in the set takes no more time than if they were in an array or linked list (and perhaps they actually are). Typically, the set will be very much smaller than the total number of objects, so it is really worthwhile to perform this computation. It's good enough if the data structure assumes a fixed d, but if it works for an arbitrary d, even better.

Have you seen this problem before, or something similar to it? What is a good solution?

To be exact: a straightforward solution involves computing the dissimilarities between all pairs of objects, but this is slow - O(n2) where n is the number of objects. Is there a general solution with lower complexity?

like image 402
reinierpost Avatar asked Dec 11 '09 16:12

reinierpost


1 Answers

I need to produce a data structure that maps any object o to the set of objects no more dissimilar to o than d, for some dissimilarity value d.

It might be fastest to just abandon the similarity computation when the subtotal becomes larger than d. For example, if your similarities are based on cosine or hausdorff distances this can easily be done.

 

PS: if this cannot be done, your problem might be related to the k-nearest neighbors problem (or more precise a nearest neighbor problem with a threshold neighborhood). You should look for algorithms that find close-by members without computing all distances (maybe something using triangle inequality). Wikipedia should help you to explore suitable algorithms.

like image 74
akuhn Avatar answered Sep 27 '22 19:09

akuhn