Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Probabilistic hashing -- is there such a thing?

Say you want to implement a click tracker where you want to only count a click to a link from any IP address once, but the number of links and clients is very large and you don't want to keep a table of every single IP-click. Say that you might need this as part of something that runs live against every click and don't want to do a lookup against a big table for every click.

Is there such a thing as "probabilistic hashing" or "lossy hashing" to see if an IP is probably in a set but you don't care if there is a certain error rate as you want to save resources?

like image 850
ʞɔıu Avatar asked Jun 17 '09 16:06

ʞɔıu


4 Answers

You could probably (ab?)use a bloom filter for something like this.

like image 164
Hank Gay Avatar answered Sep 22 '22 16:09

Hank Gay


Assuming IPv4 addresses, there is a search space of 232. You need no more than 1 bit per IP address (0 == no visit, 1 == visit). Without considering storage overhead, this would take 512 MB (229) to store. So a simplistic implementation would allocate a 512 MB array (or a table with 229 rows, each storing a byte, or 227 rows, each storing a 32-bit integer, or 226 rows, each storing a 64-bit integer, or...)

You could optimize this for sparse population by turning it into a tree.

Define a "page" size of 2x bits. You will allocate storage for one page at a time.

Divide your search space (232) by your page size. This is the total number of pages required to represent every possible address in your search space.

Then, to determine if there is a hit in your hash, you will first determine if the page is present, and if so, whether the appropriate bit in the page is set. To cache an address, you will first determine if the page is present; if not you will create it. Next you will set the appropriate bit.

This molds fairly easily to a database table. You would need just two columns: a page index and a binary array. When you allocate a page, you will simply store a row in the table with the correct page index and an empty binary array.

For instance, for a 1024-bit page size (yielding 222 maximum pages), you might structure your table like this:

CREATE TABLE VisitedIPs(
    PageIndex int         NOT NULL PRIMARY KEY,
    PageData  binary(128) NOT NULL
)

To check whether an IP has visited, you would use code similar to (pseudocode):

uint ip = address.To32Bit();

string sql =
    "SELECT PageData " +
    "FROM VisitedIPs " +
    "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

byte b = page[(ip & 0x3FF) >> 3];

bool hasVisited = (b & (1 << (ip & 7)) != 0;

Setting that an IP has visited is similar:

uint ip = address.To32Bit();

string sql =
    "SELECT PageData " +
    "FROM VisitedIPs " +
    "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

page[(ip & 0x3FF) >> 3] |= (1 << (ip & 7));

sql =
    "UPDATE VisitedIPs " +
    "SET PageData = @pageData " +
    "WHERE PageIndex = " + (ip >> 10);

ExecSQL(sql, new SqlParam("@pageData", page));
like image 39
P Daddy Avatar answered Sep 20 '22 16:09

P Daddy


All hashing is lossy, by virtue of the pigeonhole principle. Inevitably, you are trying to cram N things into M slots (where N >> M). All you need to do is simply not handle collision cases and pick a sufficiently large hash table.

like image 39
John Feminella Avatar answered Sep 20 '22 16:09

John Feminella


Sure! Decide how many "bins" you can afford (at one bit each), say N; hash the IP address to a string of bits B; take B modulo N. You can compute the probability of accidental collisions (with some approximation, such as, assume all hashed IP addresses form equally likely bitstrings B) and determine N accordingly if you have a constraint on maximum accidental collision probability that's acceptable for your application.

like image 42
Alex Martelli Avatar answered Sep 21 '22 16:09

Alex Martelli