Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:

0110101010101010110101010101

Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.

This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.

So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.

I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.

Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.

Anyone's thoughts would be appreciated.

like image 899
Williamf Avatar asked Jan 04 '12 21:01

Williamf


1 Answers

I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).

Then, I used a BK Tree to compare the strings.

like image 113
Williamf Avatar answered Sep 22 '22 23:09

Williamf