We have this typical scenario: 1 column family with less than 10 simple columns. When we get request from client we need to write 10 000 000 records of this column family in database and we are writing them in batches (1000 in one batch). This usually lasts for 5-10 minutes depending on number of nodes in cluster and replication factor. After starting writes in next few hours we will receive lots of updates (each record is updated 2 times). So we have lots of writes/updates in one period of time in day (one hour) and after that very little. Question is: what steps to make to improve write/update performance. I have noticed for example memtable_flush_queue_size and similar configuration fields but I don't have enough experience with cassandra to know exactly what to do. Any suggestion is helpful, Ivan

<ol> <li>Increase JVM memory (max 12 GB on java 6+) - this will automatically increase size of memtables and reduce flush intervals. This means also, that frequent updates will be merged together in RAM and not during compaction - this will reduce disk usage as well. Like always there is disadvantage - cassandra will need more time to start, because commit log will get larger (it's removed when memtable is flushed into SSTable)</li> <li>VERY IMPORTANT: use separate disk for data and for commit log. You could use SSD for data. It makes no sence for commit log, because it's sequential write. </li> <li>Changing replication factor to 1 will generate less load in cluster, because each node will have to take care of its data and not additionally replicas, but you might lose data - I would not recomend it.</li> </ol> This might help to get some better understanding: http://maciej-miklas.blogspot.de/2012/09/cassanrda-tuning-for-frequent-column.html http://maciej-miklas.blogspot.de/2012/08/cassandra-11-reading-and-writing-from.html

Tuning write performance in cassandra

Tags:

nosql

cassandra

We have this typical scenario:

1 column family with less than 10 simple columns.

When we get request from client we need to write 10 000 000 records of this column family in database and we are writing them in batches (1000 in one batch). This usually lasts for 5-10 minutes depending on number of nodes in cluster and replication factor.

After starting writes in next few hours we will receive lots of updates (each record is updated 2 times).

So we have lots of writes/updates in one period of time in day (one hour) and after that very little.

Question is: what steps to make to improve write/update performance. I have noticed for example memtable_flush_queue_size and similar configuration fields but I don't have enough experience with cassandra to know exactly what to do.

Any suggestion is helpful,

Ivan

677

asked Feb 17 '14 10:02

Ivan Longin

2 Answers

Increase JVM memory (max 12 GB on java 6+) - this will automatically increase size of memtables and reduce flush intervals. This means also, that frequent updates will be merged together in RAM and not during compaction - this will reduce disk usage as well. Like always there is disadvantage - cassandra will need more time to start, because commit log will get larger (it's removed when memtable is flushed into SSTable)
VERY IMPORTANT: use separate disk for data and for commit log. You could use SSD for data. It makes no sence for commit log, because it's sequential write.
Changing replication factor to 1 will generate less load in cluster, because each node will have to take care of its data and not additionally replicas, but you might lose data - I would not recomend it.

This might help to get some better understanding:

http://maciej-miklas.blogspot.de/2012/09/cassanrda-tuning-for-frequent-column.html

http://maciej-miklas.blogspot.de/2012/08/cassandra-11-reading-and-writing-from.html

172

answered Nov 15 '22 17:11

Maciej Miklas

In addition to Maciej's good points, I would add at a higher level that using batches to bulk load normal writes is an antipattern. Its main effect is to make your workload more "bursty" which is Bad. Use batches only when you have writes that need to be done together for consistency.

For bulk load, consider batching them at the source and using sstableloader, but I wouldn't recommend investing that effort until the ~100M row level.

answered Nov 15 '22 18:11

jbellis

Related questions
                            
                                Single table inheritance, EAV or NoSQL?
                            
                                Searching in Firebase without server side code
                            
                                Simple retrofit2 request to a localhost server
                            
                                When to use dynamoDB -UseCases
                            
                                When using findOneAndUpdate(), how to leave fields as is if no value provided (instead of overwriting with null)?
                            
                                What NoSQL database to use as replacement for MySQL?
                            
                                DynamoDB for PHP sessions
                            
                                MongoDB - Materialized View/OLAP Style Aggregation and Performance
                            
                                Embedded Java key-value storage [closed]
                            
                                Bigtable / HBase: Rich column family vs a single JSON Object
                            
                                Retrieve data in Firebase exactly once
                            
                                Aerospike Design | Request Flow Internals | Resources
                            
                                How to read data from Mongodb which have duplicate element name in c#
                            
                                Why we require Apache Kafka with NoSQL databases?
                            
                                How to create TTL Index on long timestamp in MongoDB
                            
                                couchdb - deleting revision
                            
                                NoSQL DataBases [closed]
                            
                                Rails: mixing NOSQL & SQL Databases
                            
                                Hamming Distance / Similarity searches in a database
                            
                                Practical examples of MongoDB documents [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With