We have this typical scenario:
1 column family with less than 10 simple columns.
When we get request from client we need to write 10 000 000 records of this column family in database and we are writing them in batches (1000 in one batch). This usually lasts for 5-10 minutes depending on number of nodes in cluster and replication factor.
After starting writes in next few hours we will receive lots of updates (each record is updated 2 times).
So we have lots of writes/updates in one period of time in day (one hour) and after that very little.
Question is: what steps to make to improve write/update performance. I have noticed for example memtable_flush_queue_size and similar configuration fields but I don't have enough experience with cassandra to know exactly what to do.
Any suggestion is helpful,
Ivan
First, writing data is very fast in Cassandra, because its design does not require performing disk reads or seeks. The memtables and SSTables save Cassandra from having to perform these operations on writes, which slow down many databases. All writes in Cassandra are append-only.
Cassandra is excellent for write operations but not so fast on read operations. Both are rather fast but Cassandra does write operation faster. Cassandra has benefits being +HA (no SPOF) + having tuneable Consistency. Cassandra is very fast writing bulk data in sequence and reading them sequentially.
With a quite small load for Cassandra, like 10 000 writes/s, you can end up with 100 000 business operations stored, once again, only in memory, just like in Redis :) Before you start panicking and moving all your data to a good old RDBMS, hold on. Cassandra is a distributed DB.
This might help to get some better understanding:
http://maciej-miklas.blogspot.de/2012/09/cassanrda-tuning-for-frequent-column.html
http://maciej-miklas.blogspot.de/2012/08/cassandra-11-reading-and-writing-from.html
In addition to Maciej's good points, I would add at a higher level that using batches to bulk load normal writes is an antipattern. Its main effect is to make your workload more "bursty" which is Bad. Use batches only when you have writes that need to be done together for consistency.
For bulk load, consider batching them at the source and using sstableloader, but I wouldn't recommend investing that effort until the ~100M row level.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With