Bloomfilter and Cassandra = Why used and why hashed several times?

Tags:

I Read this: http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html

My Questions:

1.) Is it correct, that Cassandra only uses the bloom filter, to find out the SST (Sorted String Table) which most likely contains the key? As there might be several SSTs and Cassandra does not know in Which SST a key might be? So to speed this up looking in all SSTs bloomfilters are used. Is this correct? (I am trying to understand how cassandra works...)

2.) Why are (as explained in the link above) keys hashed several times? Is it correct that the keys need to be hashed with different Hash functions several times, to get a better "random distribution of the" Bits? If this is wrong, why does a key need to be hashed several times? This will cost CPU cycles? If I have the output of several Hash functions, what is done with the results, are they ANDed or XORded. Does this make any difference?

3.)Using MD5 how big is the difference of "Fales positives by using the Bloomfilter" compared to SHA1 (which according to the articles is random distributed)? Why is MD5 not random distributed?

Thanks very much!! Jens

790

asked May 01 '11 15:05

jens

2 Answers

1) Yes, see this in the cassandra wiki,

Cassandra uses bloom filters to save IO when performing a key lookup: each SSTable has a bloom filter associated with it that Cassandra checks before doing any disk seeks, making queries for keys that don't exist almost free

The columns of a key may be spread out in several sstables. If it wasn't for bloom filters, every read of a key would have to read every sstable, which is prohibitively expensive. By using bloom filters, cassandra almost always only has to look in the sstables which contain data for that key.

2) This might give you a better understanding of bloom filters. You hash k times to give independent positions in an array of size m. For example, if A and B are the elements in the set, and you have k = 2, your hash functions are md5 and sha1, and m = 16, you can do

md5(A) % m = 7
sha1(A) % m = 12

md5(B)  % m = 15
sha1(B)  % m = 12

This gives you m[7], m[12] and m[15] are true for the filter.

To see if C is in the set, you do

md5(C)  % m = 8
sha1(C) % m = 12

Since m[8] is false, you know C is not in the set, however, for D

md5(D)  % m = 7
sha1(D)  % m = 15

Both m[7] and m[15] is true, but D is not in the set, so D is a false positive.

This does cost cpu cycles, but you are trading cpu cycles for reduced io, which makes sense for cassandra.

3) The article doesn't mention md5. md5 is randomly distributed, and I would guess the difference between md5 and sha-1 for bloom filters is not large.

177

answered Sep 20 '22 12:09

sbridges

As an addition to the 3rd point of the answer by sbridges.

MD5 and SHA-1 are randomly distributed but are cryptographic hash functions. While implementing any type of bloom filter, the only bottleneck in the performance is time taken for hashing. This is why, cryptographic functions when used decrease the performance of the application.

It is recommended to use non-cryptographic hash functions like Murmur hash. This paper, recommends to construct and hash function like:

g(x) = h1(x) + i * h2(x)

where g(x) is the new hash function, h1 and h2 are standard hash functions and i is the number of iteration ranging from 0 to k.

By using this technique, the same performance can be reached with two hash functions (assuming k > 2).

answered Sep 19 '22 12:09

Neeraj Sharma

Related questions
                            
                                Improving the distribution of hash function values
                            
                                ObjectDisposedException When using MD5 ComputeHash
                            
                                Combine MD5 hashes of multiple files
                            
                                Hashing an array or object in python 3
                            
                                Merge a hash with the key/values of a string in ruby
                            
                                How safe is it to rely on hashes for file identification?
                            
                                When does it make sense to presize a hash?
                            
                                How does symstore calculate the directory hash value
                            
                                Is there a default hash function for an unordered_set of a custom class?
                            
                                Getting ruby hash values by an array of keys
                            
                                Is it possible to hash pointers in portable C++03 code?
                            
                                Can anybody explain how java design HashMap's hash() function? [duplicate]
                            
                                Can I Expose The Hashed Salt + Password?
                            
                                PHP password_verify() and slow equals comparison
                            
                                How Do I Pull a Single Commit In Git?
                            
                                Is it possible to get hash values as compile-time constants?
                            
                                firefox cache hash key generation algorithm bug
                            
                                Initialize Java Generic Array of Type Generic
                            
                                Can someone point me to a good PHP/MySQL salted hashed password implementation?
                            
                                Reversing a hash function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Bloomfilter and Cassandra = Why used and why hashed several times?

Tags:

hash

cassandra

bloom-filter

jens

People also ask

2 Answers

sbridges

Neeraj Sharma

Recent Activity

Donate For Us