Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra : Batch write optimisation

I get bulk write request for let say some 20 keys from client. I can either write them to C* in one batch or write them individually in async way and wait on future to get them completed.

Writing in batch does not seem to be a goo option as per documentation as my insertion rate will be high and if keys belong to different partitions co-ordinators will have to do extra work.

Is there a way in datastax java driver with which I can group keys which could belong to same partition and then club them into small batches and then do invidual unlogged batch write in async. IN that way i make less rpc calls to server at the same time coordinator will have to write locally. I will be using token aware policy.

like image 780
Peter Avatar asked Aug 13 '16 10:08

Peter


People also ask

How does Cassandra batch work?

In Cassandra, batch allows the client to group related updates into a single statement. If some of the replicas for the batch fail mid-operation, the coordinator will hint those rows automatically.

What is batch statement in Cassandra?

The batch statement combines multiple data modification language statements (such as INSERT, UPDATE, and DELETE) to achieve atomicity and isolation when targeting a single partition or only atomicity when targeting multiple partitions.

What is atomic batch operation?

An atomic transaction is an indivisible and irreducible series of operations such that either all occur, or nothing occurs. Single partition batch operations are atomic automatically, while multiple partition batch operations require the use of a batchlog to ensure atomicity.

What is the main use case for single partition batches?

Single partition batches should be used when atomicity and isolation is required. Even if you only need atomicity (and no isolation) you should model your data so that you can use single partition instead of multi partition batches.


1 Answers

Your idea is right, but there is no built-in way, you usually do that manually.

Main rule here is to use TokenAwarePolicy, so some coordination would happen on driver side. Then, you could group your requests by equality of partition key, that would probably be enough, depending on your workload.

What I mean by 'grouping by equality of partition key` is e.g. you have some data that looks like

MyData { partitioningKey, clusteringKey, otherValue, andAnotherOne }

Then when inserting several such objects, you group them by MyData.partitioningKey. It is, for all existsing paritioningKey values, you take all objects with same partitioningKey, and wrap them in BatchStatement. Now you have several BatchStatements, so just execute them.

If you wish to go further and mimic cassandra hashing, then you should look at cluster metadata via getMetadata method in com.datastax.driver.core.Cluster class, there is method getTokenRanges and compare them to result of Murmur3Partitioner.getToken or any other partitioner you configured in cassandra.yaml. I've never tried that myself though.

So, I would recommend to implement first approach, and then benchmark your application. I'm using that approach myself, and on my workload it works far better than without batches, let alone batches without grouping.

like image 139
folex Avatar answered Sep 23 '22 01:09

folex