Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra distinct counting

I need to count bunch of "things" in Cassandra. I need to increase ~100-200 counters every few seconds or so.

However I need to count distinct "things".

In order not to count something twice, I am setting a key in a CF, which program reads before increase the counter, e.g. something like:

result = get cf[key];
if (result == NULL){
    set          cf[key][x] = 1;
    incr counter_cf[key][x];
}

However this read operation slows down the cluster a lot. I tried to decrease reads, using several columns, e.g. something like:

result = get cf[key];

if (result[key1]){
    set          cf[key1][x] = 1;
    incr counter_cf[key1][x];
}

if (result[key2]){
    set          cf[key2][x] = 1;
    incr counter_cf[key2][x];
}

//etc....

Then I reduced the reads from 200+ to about 5-6, but it still slows down the cluster.

I do not need exact counting, but I can not use bit-masks, nor bloom-filters, because there will be 1M+++ counters and some could go more than 4 000 000 000.

I am aware of Hyper_Log_Log counting, but I do not see easy way to use it with that many counters (1M+++) either.

At the moment I am thinking of using Tokyo Cabinet as external key/value store, but this solution, if works, will not be as scalable as Cassandra.

like image 882
Nick Avatar asked Oct 05 '22 17:10

Nick


1 Answers

Using Cassandra for the distinct counting is not ideal when the number of distinct values is big. Any time you need to do a read before a write you should ask yourself if Cassandra is the right choice.

If the number of distinct items is smaller you can just store them as column keys and do a count. A count is not free, Cassandra still has to assemble the row to count the number of columns, but if the number of distinct values is in the order of thousands it's probably going to be ok. I assume you've already considered this option and that it's not feasible for you, I just thought I'd mention it.

The way people typically do it is having the HLL's or Bloom filters in memory and then flushing them to Cassandra periodically. I.e. not doing the actual operations in Cassandra, just using it for persistance. It's a complex system, but there's easy way of counting distinct values, especially if you have a massive number of counters.

Even if you switched to something else, for example to something where you can do bit operations on values, you still need to guard against race conditions. I suggest that you simply bite the bullet and do all of your counting in memory. Shard the increment operations over your processing nodes by key and keep the whole counter state (both incremental and distinct) in memory on those nodes. Periodically flush the state to Cassandra and ack the increment operations when you do. When a node gets an increment operation for a key it does not have in memory it loads that state from Cassandra (or creates a new state if there's nothing in the database). If a node crashes the operations have not been acked and will be redelivered (you need a good message queue in front of the nodes to take care of this). Since you shard the increment operations you can be sure that a counter state is only ever touched by one node.

like image 194
Theo Avatar answered Oct 10 '22 01:10

Theo