I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With