Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which data structure to store binary strings and query with hamming distane

I'm looking for a data structure to handle bilions of binary strings that contains 512 binary values.

My goal is to send querys to the structure and get a resultset which contains all data that are lower a distance.

My first idea was to use a kd tree. But those trees are very slow for a high dimension. My second idea is to use a lsh approach (minHash/ superbit) lsh. But for that I must also have a structure to perform efficient search

Any ideas how to handle these big data?

**Update ** Some detail notes :

  • for the hamming distance exists only a upper limit that is maybe 128. But in time I doesn't know the upper limit
  • a insertion or a deletion would be nice but I also can rebuild the graph (the data base only updated onces a week)
  • the result set must contain all relevant nodes (I'm not looking for knn)
like image 758
501 - not implemented Avatar asked Feb 24 '16 13:02

501 - not implemented


1 Answers

Without knowing your intended search parameters, it's hard to be too optimized. That said, I think a good approach would be to build a B- or T- tree and then optimize that structure for the binary nature of the data.

Specifically, you have 64 bytes of data as a 512 element bit-string. Your estimate is that you will have "bilions" of records. That's on the order of 232 values, so 1/16th of the space will be full? (Does this agree with your expectations?)

Anyway, try breaking the data into bytes, let each byte be a key level. You can probably compress the level records, if the probability of set bits is uniform. (If not, if say set bits are more likely at the front of the key, then you might want to just allocate 256 next-level pointers and have some be null. It's not always worth it.)

All of your levels will be the same- they will represent 8 more bits of the string. So compute a table that maps, for a byte, all the byte values that are within distance S from that byte, 0 <= S <= 8. Also, compute a table that maps two bytes to the distance E between them, hamming(a,b).

To traverse the tree, let your search distance be SD. Set D = SD. Read the top level block. Find all 8-bits values in the block less than distance min(8, D) from your query. For each value, compute the exact distance hamming(query, value) and recurse to the lower block with D = D - hamming(query, value) for that sub-tree.

like image 163
aghast Avatar answered Sep 28 '22 08:09

aghast