I'd like to use Cassandra to store a counter. For example how many times a given page has been viewed. The counter will never decrement. The value of the counter does not need to be exact but it should be accurate over time.
My first thought was to store the value as a column and just read the current count, increment it by one and then put it back in. However if another operation is also trying to increment the counter, I think the final value would just be the one with the latest timestamp.
Another thought would be to store each page load as a new column in a CF. Then I could just run get_count()
on that key and get the number of columns. Reading through the documentation, it appears that it is not a very efficient operation at all.
Am I approaching the problem incorrectly?
Counters have been added to Cassandra 0.8
Use the incr method increment the value of a column by 1.
[default@app] incr counterCF [ascii('a')][ascii('x')];
Value incremented.
[default@app] incr counterCF [ascii('a')][ascii('x')];
Value incremented.
Describe here: http://www.jointhegrid.com/highperfcassandra/?p=79
Or it can be done programatically
CounterColumn counter = new CounterColumn();
ColumnParent cp = new ColumnParent("page_counts_by_minute");
counter.setName(ByteBufferUtil.bytes(bucketByMinute.format(r.date)));
counter.setValue(1);
c.add(ByteBufferUtil.bytes( bucketByDay.format(r.date)+"-"+r.url)
, cp, counter, ConsistencyLevel.ONE);
Described here: http://www.jointhegrid.com/highperfcassandra/?cat=7
[Update] Looks like counter support will be ready for primetime in 0.8!
I definitely wouldn't use get_count, as that is an O(n) operation which is ran every time you read the "counter." Worse than it being just O(n) it may span multiple nodes which would introduce network latency. And finally, why tie up all that disk space when all you care about is a single number?
For right now, I wouldn't use Cassandra for counters at all. They are working on this functionality, but it's not ready for prime time yet.
https://issues.apache.org/jira/browse/CASSANDRA-1072
You've got a few options in the mean time.
1) (Bad) Store your count in a single record and have one and only one thread of your application be responsible for counter management.
2) (Better) Split the counter into n shards, and have n threads manage each shard as a separate counter. You can randomize which thread is used by your app each time for stateless load balancing across these threads. Just make sure that each thread is responsible for exactly one shard.
3a) (Best) Use a separate tool that is either transactional (aka an RDBMS) or that supports atomic increment operations (memcached, redis).
[Update.2] I would avoid using a distributed lock (see memcached and zookeeper mutexes), as this is very intolerant to node failure or network partitioning if improperly implemented.
What I ended up doing was using get_count() and caching the result in a caching ColumnFamily.
This way I could get a general guess at the count but still get the exact count whenever I wanted.
Additionally, I was able to adjust how stale the data I was willing to accept on a per request basis.
We are going to address a similar problem by keeping the current value of a counter in a distributed cache (for example - memcached). When the counter is updated, we will store its value in Cassandra. Therefore even if some cache node fails, we will be able to get the value from the database.
This solution is not perfect. However data such a visit counter are not very sensitive so minor inconsistencies are allowed in my opinion.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With