Probabilistic hashing -- is there such a thing?

Question

Say you want to implement a click tracker where you want to only count a click to a link from any IP address once, but the number of links and clients is very large and you don't want to keep a table of every single IP-click. Say that you might need this as part of something that runs live against every click and don't want to do a lookup against a big table for every click.

Is there such a thing as "probabilistic hashing" or "lossy hashing" to see if an IP is probably in a set but you don't care if there is a certain error rate as you want to save resources?

Hank Gay · Accepted Answer

You could probably (ab?)use a bloom filter for something like this.

P Daddy · Answer

Assuming IPv4 addresses, there is a search space of 2³². You need no more than 1 bit per IP address (0 == no visit, 1 == visit). Without considering storage overhead, this would take 512 MB (2²⁹) to store. So a simplistic implementation would allocate a 512 MB array (or a table with 2²⁹ rows, each storing a byte, or 2²⁷ rows, each storing a 32-bit integer, or 2²⁶ rows, each storing a 64-bit integer, or...)

You could optimize this for sparse population by turning it into a tree.

Define a "page" size of 2^x bits. You will allocate storage for one page at a time.

Divide your search space (2³²) by your page size. This is the total number of pages required to represent every possible address in your search space.

Then, to determine if there is a hit in your hash, you will first determine if the page is present, and if so, whether the appropriate bit in the page is set. To cache an address, you will first determine if the page is present; if not you will create it. Next you will set the appropriate bit.

This molds fairly easily to a database table. You would need just two columns: a page index and a binary array. When you allocate a page, you will simply store a row in the table with the correct page index and an empty binary array.

For instance, for a 1024-bit page size (yielding 2²² maximum pages), you might structure your table like this:

CREATE TABLE VisitedIPs(
    PageIndex int         NOT NULL PRIMARY KEY,
    PageData  binary(128) NOT NULL
)

To check whether an IP has visited, you would use code similar to (pseudocode):

uint ip = address.To32Bit();

string sql =
    "SELECT PageData " +
    "FROM VisitedIPs " +
    "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

byte b = page[(ip & 0x3FF) >> 3];

bool hasVisited = (b & (1 << (ip & 7)) != 0;

Setting that an IP has visited is similar:

uint ip = address.To32Bit();

string sql =
    "SELECT PageData " +
    "FROM VisitedIPs " +
    "WHERE PageIndex = " + (ip >> 10);

byte[] page = (byte[])GetFromDB(sql);

page[(ip & 0x3FF) >> 3] |= (1 << (ip & 7));

sql =
    "UPDATE VisitedIPs " +
    "SET PageData = @pageData " +
    "WHERE PageIndex = " + (ip >> 10);

ExecSQL(sql, new SqlParam("@pageData", page));

John Feminella · Answer

All hashing is lossy, by virtue of the pigeonhole principle. Inevitably, you are trying to cram N things into M slots (where N >> M). All you need to do is simply not handle collision cases and pick a sufficiently large hash table.

Alex Martelli · Answer

Sure! Decide how many "bins" you can afford (at one bit each), say N; hash the IP address to a string of bits B; take B modulo N. You can compute the probability of accidental collisions (with some approximation, such as, assume all hashed IP addresses form equally likely bitstrings B) and determine N accordingly if you have a constraint on maximum accidental collision probability that's acceptable for your application.

Probabilistic hashing -- is there such a thing?

Tags:

algorithm

data-structures

ʞɔıu

4 Answers

Hank Gay

P Daddy

John Feminella

Alex Martelli

Recent Activity

Donate For Us

Probabilistic hashing -- is there such a thing?

Tags:

algorithm

data-structures

ʞɔıu

4 Answers

Hank Gay

P Daddy

John Feminella

Alex Martelli

Related questions

Recent Activity

Donate For Us