Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient comparison of 100.000 vectors

I save 100.000 Vectors of in a database. Each vector has a dimension 60. (int vector[60])

Then I take one and want present vectors to the user in order of decreasing similarity to the chosen one.

I use Tanimoto Classifier to compare 2 vectors:

alt text

Is there any methods to avoid doing through all entries in the database?

One more thing! I don't need to sort all vectors in the database. I whant to get top 20 the most similar vectors. So maybe we can roughly threshold 60% of entries and use the rest for sorting. What do you think?

like image 329
user101375 Avatar asked Jun 15 '09 16:06

user101375


2 Answers

First, preprocess your vector list to make each vector normalized.. unit magnitude. Notice now that your comparison function T() now has magnitude terms that become constant, and the formula can be simplified to finding the largest dot product between your test vector and the values in the database.

Now, think of a new function D = the distance between two points in 60D space. This is classic L2 distance, take the difference of each component, square each, add all the squares, and take the square root of the sum. D(A, B) = sqrt( (A-B)^2) where A and B are each 60 dimensional vectors.

This can be expanded, though, to D(A, B) = sqrt(A * A -2*dot(A,B) + B * B). A and B are unit magnitude, then. And the function D is monotonic, so it won't change the sort order if we remove the sqrt() and look at squared distances. This leaves us with only -2 * dot(A,B). Thus, miniumizing distance is exactly equivalent to maximizing dot product.

So the original T() classificiation metric can be simplified into finding the highest dot product between the nornalized vectors. And that comparison is shown equivalent to finding the closest points to the sample point in 60-D space.

So now all you need to do is solve the equivalent problem of "given a normalized point in 60D space, list the 20 points in the database of normalized sample vectors which are nearest to it."

That problem is a well understood one.. it's K Nearest Neighbors. There are many algorithms for solving this. The most common is classic KD trees .

But there's a problem. KD trees have an O(e^D) behavior.. high dimensionality quickly becomes painful. And 60 dimensions is definitely in that extremely painful category. Don't even try it.

There are several alternative general techniques for high D nearest neighbor however. This paper gives a clear method.

But in practice, there's a great solution involving yet another transform. If you have a metric space (which you do, or you wouldn't be using the Tanimoto comparison), you can reduce the dimensionality of the problem by a 60 dimensional rotation. That sounds complex and scary, but it's very common.. it's a form of singular value decomposition, or eigenvalue decomposition. In statistics, it's known as Principal Components Analysis.

Basically this uses a simple linear computation to find what directions your database really spans. You can collapse the 60 dimensions down to a lower number, perhaps as low as 3 or 4, and still be able to accurately determine nearest neighbors. There are tons of software libraries for doing this in any language, see here for example.

Finally, you'll do a classic K nearest neighbors in probably only 3-10 dimensions.. you can experiment for the best behavior. There's a terrific library for doing this called Ranger, but you can use other libraries as well. A great side benefit is you don't even need to store all 60 components of your sample data any more!

The nagging question is whether your data really can be collapsed to lower dimensions without affecting the accuracy of the results. In practice, the PCA decomposition can tell you the maximum residual error for whatever D limit you choose, so you can be assured it works. Since the comparison points are based on a distance metric, it's very likely they are intensely correlated, unlike say hash table values.

So the summary of the above:

  1. Normalize your vectors, transforming your problem into a K-nearest neighbor problem in 60 dimensions
  2. Use Principal Components Analysis to reduce dimensionality down to a manageable limit of say 5 dimensions
  3. Use a K Nearest Neighbor algorithm such as Ranger's KD tree library to find nearby samples.
like image 166
SPWorley Avatar answered Sep 23 '22 13:09

SPWorley


Update:

After you made clear that 60 is the dimension of your space, not the length of the vectors, the answer below is not applicable for you, so I'll keep it just for history.


Since your vectors are normalized, you can employ kd-tree to find the neighbors within an MBH of incremental hypervolume.

No database I'm aware of has native support of kd-tree, so you can try to implement the following solution in MySQL, if you are searching for a limited number of closest entries:

  • Store the projections of the vectors to each of 2-dimensional space possible (takes n * (n - 1) / 2 columns)
  • Index each of these columns with a SPATIAL index
  • Pick a square MBR of a given area within any projection. The product of these MBR's will give you a hypercube of a limited hypervolume, which will hold all vectors with a distance not greater than a given one.
  • Find all projections within all MBR's using MBRContains

You'll still need to sort within this limited range of values.

For instance, you have a set of 4-dimensional vectors with magnitude of 2:

(2, 0, 0, 0)
(1, 1, 1, 1)
(0, 2, 0, 0)
(-2, 0, 0, 0)

You'll have to store them as follows:

p12  p13  p14  p23  p24  p34
---  ---  ---  ---  ---  ---
2,0  2,0  2,0  0,0  0,0  0,0
1,1  1,1  1,1  1,1  1,1  1,1
0,2  0,0  0,0  2,0  2,0  0,0
-2,0 -2,0 -2,0 0,0  0,0  0,0

Say, you want similarity with the first vector (2, 0, 0, 0) greater than 0.

This means having the vectors inside the hypercube: (0, -2, -2, -2):(4, 2, 2, 2).

You issue the following query:

SELECT  *
FROM    vectors
WHERE   MBRContains('LineFromText(0 -2, 4 2)', p12)
        AND MBRContains('LineFromText(0 -2, 4 2)', p13)
        …

, etc, for all six columns

like image 27
Quassnoi Avatar answered Sep 21 '22 13:09

Quassnoi