Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good algorithm to check whether or not a number exist in multiple sets without searching them all?

Scenario

Let's say you have multiple databases in 3 zones. Zone A, B, and C. Each zone in different geographical location. At the same time, you have an application that will route username and password based on the geographical location of the user. For example, user A will be redirected to database in Zone A. User B Zone B and so on.

Now, let's say user A moves to a zone B. The application query zone B and won't find anything. Querying zone A and zone C might take some time due to zones are far away, and will have to query all the databases in all zones.

My Question

How can you verify if a string/number exists in multiple sets?

or

How can you verify a row exist in the database before even sending a query?

My Algorithm

This is not perfect, but will give you some idea what I'm trying to do

If we have the database with the following 3 users

  • foo
  • bar
  • foobar

We take the hash of all 3 users, and look for the next prime number if the hash is not prime.

sum = hash(foo).nextPrime() * hash(bar).nextPrime() * hash(foobar).nextPrime()

That sum is shared between all zones. If I want to check foo, I can just take the hash of foo, and look for the next prime, then take the gcd(foo,sum). If it's not equal to one. It means foo exist in some database. If it equal to one, it means foo doesn't exist at all. If I want to add a new username. I can simply do sum = sum * hash(newUserName).nextPrime().

Sum will grow to a point that will be just faster to query all databases.

Do you know a similar algorithm to solve this problem?

like image 363
Ahmed Avatar asked May 11 '15 23:05

Ahmed


1 Answers

One data structure suitable for this application is a Bloom filter.

A Bloom filter is a probabilistic data structure which allows you test whether an item is already in a set. If the test returning false then the item is definitely not in the set (0% false negatives), if true then it may be in the set, but is not guaranteed to be (false positives are possible).

The filter is implemented as a bit array with m bits and a set of k hash functions. To add an item to the array (e.g. a username), hash the item using each of the hash functions and then take the modulo m of each hash value to compute the indexes to set in the bit array. To test if an item is in the set, compute all the hashes and indexes and check that all of the corresponding bits in the array are set to 1. If any of them are zero them the item is definitely not in the set, if all are 1 then the item is most likely in the set, but there is a small chance it may not be, the percentage of false positives can be reduced by using a larger m.

To implement the k hash functions, it is possible to just use the same hashing algorithm (e.g. CRC32, MD5 etc) but append different salts to the username string for each before passing to the hash function, effectively creating "new" hash functions for each salt. For a given m and n (number of elements being added), the optimal number of hash functions is k = (m / n) ln 2

For your application, the Bloom filter bit array would be shared across all zones A B C etc. When a user attempts to login, you could first check in the database of the local zone, and if present then log them in as normal. If not present in the local database, check the Bloom filter and if the result is negative then you know for sure that they don't exist in another zone. If positive, then you still need to check the databases in the other zones (because of the possibility of a false positive), but presumably this isn't a big issue because you would be contacting the other zones in any case to transfer the user's data in the case that it was a true positive.

One down-side of using a Bloom filter is that it is difficult (though not impossible) to remove elements from the set once they have been added.

like image 179
samgak Avatar answered Oct 11 '22 15:10

samgak