I have an array of 256-bit values. The array is huge (millions of records) but it is modified just rarely and it fits to the memory. For a given 256-bit number I want to find whether there exists a record which has at least N bits equal. For instance, 10000 and 01111 have 0 bits equal, 1000 and 1001 have 3 bits equal. Always N > 128, or rather N > 140. I don't need to find a particular number, I just need to find whether such number exists in the list or not.
Is there a type of data structure or some kind of index which could somehow speed up the searching?
I may be wrong, but it looks like you can index your data as bitwise trie, which for your case is identical to binary tree in structure, but is interpreted differently. Here's illustration (source):
For the sake of simplicity, let's think about your 256-bit numbers as about vectors with 256 numbers (also binary, of course). Having a tree like above, you can find whether particular vector exists in your dataset in 256 steps or less by just walking down the tree and testing if subsequent branch (either "0" or "1") exists. Something like this (code's quite raw, but you should get an idea):
def exists(node, v, i):
"""Checks if subvector of v starting from index i exists in
sub-tree defined by node
"""
if i == len(v):
return True
elif v[i] == 0 and node.left:
return exists(node.left, v, i + 1)
elif v[i] == 1 and node.right:
return exists(node.right, v, i + 1)
else:
return False
At each step we decide where to go left or right branch based on current value element of v
. But you also need to process up to K different elements. Ok, so why not to use error count and process both branches?
def exists(node, v, i, err_left=K):
"""Checks if subvector of v starting from index i exists in
sub-tree defined by node with up to err_left errors.
"""
if i == len(v):
return True
elif v[i] == 0:
if node.left:
return exists(node.left, v, i + 1, err_count)
elif node.right:
return exists(node.left, v, i + 1, err_left - 1) # proceed, but decrease
# errors left
else:
return False
elif v[i] == 1:
if node.right:
return exists(node.right, v, i + 1, err_count)
elif node.left:
return exists(node.right, v, i + 1, err_left - 1) # proceed, but decrease
# errors left
else:
return False
else:
return False
Running time of this algorithm heavily depends on your settings. In worst case (K = 256) it is still O(n) (you need to check each branch), but this time falls quickly as K decreases (with small K it is almost O(log n)). With K ~ 128 it's somewhere in the middle.
You may also want to rewrite function so that it checked good-day (no errors) scenarios first (e.g. if you expect number of errors be small on average).
Trie compressed data in a certain sense, but if you have issues with memory, try to use table representation instead of objects and pointers.
Finally, you can combine it with bloom filters to filter numbers that definitely are not in a set first, and then check the rest with above tree.
The algorithm to solve that is O(n). Your only option is looping through the array until finding a number that matches with your target.
Now, how do we find if two numbers match?. The most efficient way is using bitwise operations. For instance I write a code that works for 64 bits long numbers:
int compare(long l1, long l2)
{
//Use xor to get what bits are the same
long xor = ~ (l1 ^ l2);
//count the number of bits with value = 1
int count = 0;
while(xor != 0)
{
//check if the right bit is 1, using & operator beetween our number and 0000001
if(xor & 1 > 0) count++;
//shift right the bits
xor = xor >> 1
}
return count;
}
This comparation can be adapted, depending on how your 256 bits numbers are implemented.
An optimization could be to break the while loop when you reach count >= N
.
You can check this question to look for more efficient ways to count the bits with value 1
.
Hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With