Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to understand bloom_filter_fp_chance and read_repair_chance in Cassandra

Tags:

cassandra

Bloom Filters

When data is requested, the Bloom filter checks if the row exists before doing disk I/O. 

Read Repair

Read Repair perform a digest query on all replicas for that key

My confusion is how to set this value between 0 to 1,. What happens when the value varies?

Thanks in advance,.

like image 390
Jagadeesh Avatar asked Aug 03 '15 10:08

Jagadeesh


People also ask

How Bloom filter works in Cassandra?

Bloom filters are a probabilistic data structure that allows Cassandra to determine one of two possible states: - The data definitely does not exist in the given file, or - The data probably exists in the given file.

What are Bloom filters used for?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. For example, checking availability of username is set membership problem, where the set is the list of all registered username.

What is the time complexity of a Bloom filter?

The Bloom Filter [1] is the extensively used probabilistic data structure for membership filtering. The query response of Bloom Filter is unbelievably fast, and it is in O(1) time complexity using a small space overhead. The Bloom Filter is used to boost up query response time, and it avoids some unnecessary searching.


1 Answers

The bloom_filter_fp_chance and read_repair_chance control two different things. Usually you would leave them set to their default values, which should work well for most typical use cases.

bloom_filter_fp_chance controls the precision of the bloom filter data for SSTables stored on disk. The bloom filter is kept in memory and when you do a read, Cassandra will check the bloom filters to see which SSTables might have data for the key you are reading. A bloom filter will often give false positives and when you actually read the SSTable, it turns out that the key does not exist in the SSTable and reading it was a waste of time. The better the precision used for the bloom filter, the fewer false positives it will give (but the more memory it will need).

From the documentation:

0 Enables the unmodified, effectively the largest possible, Bloom filter
1.0 Disables the Bloom Filter
The recommended setting is 0.1. A higher value yields diminishing returns.

So a higher number gives a higher chance of a false positive (fp) when reading the bloom filter.

read_repair_chance controls the probability that a read of a key will be checked against the other replicas for that key. This is useful if your system has frequent downtime of the nodes resulting in data getting out of sync. If you do a lot of reads, then the read repair will slowly bring the data back into sync as you do reads without having to run a full repair on the nodes. Higher settings will cause more background read repairs and consume more resources, but would sync the data more quickly as you do reads.

See documentation on these settings here.

like image 50
Jim Meyer Avatar answered Sep 28 '22 18:09

Jim Meyer