What is the difference between bloom filters and hash sketches (also FM-sketches) and what is their use?
Hash sketches/Flajolet-Martin Sketches
Flajolet, P./Martin, G. (1985): Probabilistic counting algorithms for data base applications, in: Journal of Computer and System Sciences, Vol. 31, No. 2 (September 1985), pp. 182-209.
Durand, M./Flajolet, P. (2003): Loglog Counting of Large Cardinalities, in: Springer LNCS 2832, Algorithms ESA 2003, pp. 605–617.
Hash sketches are used to count the number of distinct elements in a set.
given:
- a bit array B[] of length l
- a (single) hash function h() that maps to [0,1,...2^l)
- a function r() that gives the position of the least-significant 1-bit in the binary representation of its input (e.g. 000101 returns 1, 001000 returns 4)
insertion of element x:
- pn := h(x) returns a pseudo-random number
- apply r(pn) to get the position of the bit array to set to 1
since output of h() is pseudo-random every bit i is set to 1 ~n/(2^(i+1)) times
number of distinct elements in the set:
- find the position p of the right-most 0 in the bit array
- p = log2(n), solve for n to get the number of distinct element in the set;
the result might be up to 1.83 magnitudes off
usage:
- in Data Mining, P2P/distributed applications, estimation of the document frequency, etc.
Bloom filters
Bloom, H. (1970): Space/time trade-offs in hash coding with allowable errors, in: Communications of the ACM, Vol. 13, No. 7 (July 1970), pp. 422-426.
Bloom filters are used to test whether an element is a member of a set.
given:
- a bit array B[] of length m
- k different hash functions h_k() that map to [0,...,m-1], i.e. to one of the position of the m-bit array
insertion of element x:
- apply h_k to x (h_k(x)), for all k, i.e. you get k values
- set the resulting bits in the array B to 1 (if already set to 1, don't change anything)
check if y is already in the set:
- get the positions p_k to check using all the hash functions h_k (h_k(y)), i.e. for each function h_k you get a position p_k
- if one of the positions p_k is set to 0 in the array B, the element y is definitively not in the set
- if all positions given by p_k are 1, the element y might (!) be in the set
- false positive rate is approximately (1 - e^(-kn/m))^k, no false negatives are possible!
- by increasing the number of hashing functions, the false positive rate can be decreased; however, at the same time your bloom filter gets slower; the optimal value of k is k = (m/n)ln(2)
usage:
- in the beginning used as a cheap filter in databases to filter out elements that do not match a query
- various applications today, e.g. in Google BigTable, but also in networking for IP lookups, etc.
The Bloom Filter is a data structure used for Membership lookup while FM Sketch is primarily used for counting of elements. These two data structures provide the respective solutions optimizing over the space required to perform the lookup/computation and the trade off is the accuracy of the result.