I have a Java client that pushes (INSERT) records in batch to Cassandra cluster. The elements in the batch all have the same row key, so they all will be placed in the same node. Also I don't need the transaction to be atomic so I've been using unlogged batch. The number of INSERT commands in each batch depends on different factors, but can be anything between 5 to 50000. First I just put as many commands as I had in one batch and submitted it. This threw <code>com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large</code>. Then I used a cap of 1000 INSERT per batch, and then down to 300. I noticed I'm just randomly guessing without knowing exactly where this limit comes from, which can cause trouble down the road. My question is, what is this limit? Can I modify it? How can I know how many elements can be placed in a batch? When my batch is "full"?

I fixed this issue by changing the CHUNKSIZE to a lower value (for exemple 1) https://docs.datastax.com/en/cql/3.1/cql/cql_reference/copy_r.html COPY mytable FROM 'mybackup' WITH CHUNKSIZE = 1; The operation is much slower but at least it work now

What is the batch limit in Cassandra?

Tags:

cassandra

cassandra-3.0

cassandra-2.2

I have a Java client that pushes (INSERT) records in batch to Cassandra cluster. The elements in the batch all have the same row key, so they all will be placed in the same node. Also I don't need the transaction to be atomic so I've been using unlogged batch.

The number of INSERT commands in each batch depends on different factors, but can be anything between 5 to 50000. First I just put as many commands as I had in one batch and submitted it. This threw com.datastax.driver.core.exceptions.InvalidQueryException: Batch too large. Then I used a cap of 1000 INSERT per batch, and then down to 300. I noticed I'm just randomly guessing without knowing exactly where this limit comes from, which can cause trouble down the road.

My question is, what is this limit? Can I modify it? How can I know how many elements can be placed in a batch? When my batch is "full"?

269

asked Jan 09 '16 22:01

m.hashemian

3 Answers

I would recommend not increasing the cap, and just splitting into multiple requests. Putting everything in a giant single request will negatively impact the coordinator significantly. Having everything in one partition can improve the throughput in some sized batches by reducing some latency, but batches are never meant to be used to improve performance. So trying to optimize to get maximum throughput by using different batch sizes will depend largely on use case/schema/nodes and will require specific testing, since there's generally a cliff on the size where it starts to degrade.

There is a

# Fail any batch exceeding this value. 50kb (10x warn threshold) by default.
batch_size_fail_threshold_in_kb: 50

option in your cassandra.yaml to increase it, but be sure to test to make sure your actually helping and not hurting your throughput.

answered Oct 22 '22 08:10

Chris Lohfink

Looking at the Cassandra logs you'll be able to spot things like:

ERROR 19:54:13 Batch for [matches] is of size 103.072KiB, exceeding specified threshold of 50.000KiB by 53.072KiB. (see batch_size_fail_threshold_in_kb)

answered Oct 22 '22 09:10

fivetwentysix

I fixed this issue by changing the CHUNKSIZE to a lower value (for exemple 1) https://docs.datastax.com/en/cql/3.1/cql/cql_reference/copy_r.html

COPY mytable FROM 'mybackup' WITH CHUNKSIZE = 1;

The operation is much slower but at least it work now

answered Oct 22 '22 09:10

Etienne Cha

Related questions
                            
                                How to multi insert rows in cassandra
                            
                                MAX(), DISTINCT and group by in Cassandra
                            
                                Hector vs Astyanax for Cassandra [closed]
                            
                                JUnit Testing Cassandra with embedded server
                            
                                Cassandra - What is meant by - "cannot rename non primary key part"
                            
                                Coordinator node timed out waiting for replica nodes in Cassandra Datastax while insert data
                            
                                Cassandra column key auto increment
                            
                                Inner Join in cassandra CQL
                            
                                Prepared Statement with collection in IN clause in Datastax Cassandra CQL driver
                            
                                Which part of the CAP theorem does Cassandra sacrifice and why?
                            
                                How Can I Search for Records That Have A Null/Empty Field Using CQL?
                            
                                Select first N rows of Cassandra table
                            
                                How to select data from a table and insert into another table?
                            
                                Cassandra Java driver: how many contact points is reasonable?
                            
                                Get current date in cassandra cql select
                            
                                Clustering Keys in Cassandra
                            
                                com.datastax.driver.core.exceptions.InvalidQueryException: unconfigured table schema_keyspaces
                            
                                cqlsh connection error: 'ref() does not take keyword arguments'
                            
                                Apache Cassandra remote access
                            
                                Apache Cassandra vs Datastax Cassandra [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With