Given array of bitstrings (all of the same length) and query string Q find top-k most similar strings to Q, where similarity between strings A and B is defined as number of 1 in A and B, (operation and is applied bitwise).
I think there is should be a classical result for this problem.
k is small, in hundreds, while number of vectors in hundreds of millions and length of the vectors is 512 or 1024
One way to tackle this problem is to construct a K-Nearest Neighbor Graph (K-NNG) (digraph) with a Russell-Rao similarity function.
Note that efficient K-NNG construction is still an open problem,and none of the known solutions for this problem is general, efficient and scalable [quoting from Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures - Dong, Charikar, Li 2011].
Your distance function is often called Russell-Rao similarity (see for example A Survey of Binary Similarity and Distance Measures - Choi, Cha, Tappert 2010). Note that Russell-Rao similarity is not a metric (see Properties of Binary Vector Dissimilarity Measures - Zhang, Srihari 2003): The "if" part of "d(x, y) = 0 iff x == y" is false.
In A Fast Algorithm for Finding k-Nearest Neighbors with Non-metric Dissimilarity - Zhang, Srihari 2002, the authors propose a fast hierarchical search algorithm to find k-NNs using a non-metric measure in a binary vector space. They use a parametric binary vector distance function D(β). When β=0, this function is reduced to the Russell-Rao distance function. I wouldn't call it a "classical result", but this is the the only paper I could find that examines this problem.
You may want to check these two surveys: On nonmetric similarity search problems in complex domains - Skopal, Bustos 2011 and A Survey on Nearest Neighbor Search Methods - Reza, Ghahremani, Naderi 2014. Maybe you'll find something I missed.
This problem can be solved by writing simple Map and Reduce job. I'm neither claiming that this is the best solution, nor I'm claiming that this is the only solution.
Also, you have disclosed in the comments that k is in hundreds, there are millions of bitstrings and that the size of each of them is 512 or 1024.
Now, the combiner can consolidate the list of all bitStrings from every mapper that have the same similarity.
From the output of reducer you can extract the top-k bitstrings.
So, MapReduce paradigm is probably the classical solution that you are looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With