I'm having trouble coming up with an efficient SQL query to handle the following situation:
Assume we have a table with two columns
groupId : int
value : float
The table is huge (several million rows). There are a varying amount of "values" per "groupId" - say something between 100 and 50.000. All float values are greater or equal to zero but are otherwise unbounded.
For a given groupId the query should return all other groups sorted by decreasing similarity where "similar" is defined as minimum euclidian distance between all possible pairs of 30 values in two groups.
That definition of similarity is what kills me. I think for calculating similarity as defined above the naiive algorithm is O(n^2). Now I'm looking for ideas to either redefine "similarity" or an efficient implementation of the above. I could imagine a solution involving a k-nearest neighbour, something like PostGis geometrical nearest neighbours or maybe a largest common subsequence algorithm (although I'd need a "fuzzy" implementation of the latter because "values" will hardly ever compare exactly equal).
We are currently on mySQL in case it matters.
cheers,
Sören
Here are some nice approximations:
You could calculate the center of mass of each group and then compare based on the distance of each groups center of mass.
Another way you could do it is by hash the coordinates of each row and rows that hash to the same location are considered similar and thus the two groups similarity are updated.
Some more information would be helpful such as:
Is the information constantly being updated and if so at what interval. How up to date and how accurate does it need to be?
Could you verify that I got the question right?
Your table represents vectors identified by the groupId. Every vector has a dimension of something between 100 and 50,000, but there is no order defined on the dimension. That is a vector from the table is actually a representative of equivalence class.
Now you define the similarity of two equivalence classes as the minimum Euclidian distance of the projections of any two representative of the equivalence classes to the subspace of the first 30 dimensions.
Examples for projection to two dimensions:
A = <1, 2, 3, 4>
B = <5, 6, 7, 8, 9, 10>
A represents the following equivalence class of vectors.
<1, 2, 3, 4> <2, 1, 2, 3> <3, 1, 2, 4> <4, 1, 2, 3>
<1, 2, 4, 4> <2, 1, 3, 2> <3, 1, 4, 2> <4, 1, 3, 2>
<1, 3, 2, 4> <2, 3, 1, 4> <3, 2, 1, 4> <4, 2, 1, 3>
<1, 3, 4, 2> <2, 3, 4, 1> <3, 2, 4, 1> <4, 2, 3, 1>
<1, 4, 2, 2> <2, 4, 1, 3> <3, 4, 1, 2> <4, 3, 1, 2>
<1, 4, 3, 2> <2, 4, 3, 1> <3, 4, 2, 1> <4, 3, 2, 1>
The projection of all representative of this equivalence class to the first two dimensions yields.
<1, 2> <1, 3> <1, 4>
<2, 1> <2, 3> <2, 4>
<3, 1> <3, 2> <3, 4>
<4, 1> <4, 2> <4, 3>
B represents a equivalence class with 720 elements. The projection to the first two dimensions yields 30 elements.
< 5, 6> < 5, 7> < 5, 8> < 5, 9> < 5, 10>
< 6, 5> < 6, 7> < 6, 8> < 6, 9> < 6, 10>
< 7, 5> < 7, 6> < 7, 8> < 7, 9> < 7, 10>
< 8, 5> < 8, 6> < 8, 7> < 8, 9> < 8, 10>
< 9, 5> < 9, 6> < 9, 7> < 9, 8> < 9, 10>
<10, 5> <10, 6> <10, 7> <10, 8> <10, 9>
So the distance of A and B is the square root of 8, because this is the minimum distance of two vectors from the projections. For example <3, 4> and <5, 6> yield this distance.
So, am I right with my understanding of the problem?
A really naive algorithm for n vectors with m components each would have to calculate (n - 1) distances. For each distance the algorithm would calculate the distances of m! / (m - 30)! projection for each vector. So for 100 dimensions (your lower bound) there are 2.65*10^32 possible projection for a vector. This requires to calculate about 7*10^64 distances between projections and finding the minimum to find the distance of two vectors. And then repeat this n times.
I hope that I misunderstood you or made a mistake. Else this sounds something between really challenging and not feasible.
Something I thought about is ordering the vector components and trying to match them. Using Manhattan distance - if possible - may help to simplify the solution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With