Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tuning write performance in cassandra

We have this typical scenario:

1 column family with less than 10 simple columns.

When we get request from client we need to write 10 000 000 records of this column family in database and we are writing them in batches (1000 in one batch). This usually lasts for 5-10 minutes depending on number of nodes in cluster and replication factor.

After starting writes in next few hours we will receive lots of updates (each record is updated 2 times).

So we have lots of writes/updates in one period of time in day (one hour) and after that very little.

Question is: what steps to make to improve write/update performance. I have noticed for example memtable_flush_queue_size and similar configuration fields but I don't have enough experience with cassandra to know exactly what to do.

Any suggestion is helpful,

Ivan

like image 677
Ivan Longin Avatar asked Feb 17 '14 10:02

Ivan Longin


People also ask

Why writes are faster in Cassandra?

First, writing data is very fast in Cassandra, because its design does not require performing disk reads or seeks. The memtables and SSTables save Cassandra from having to perform these operations on writes, which slow down many databases. All writes in Cassandra are append-only.

Is Cassandra optimized for read or write?

Cassandra is excellent for write operations but not so fast on read operations. Both are rather fast but Cassandra does write operation faster. Cassandra has benefits being +HA (no SPOF) + having tuneable Consistency. Cassandra is very fast writing bulk data in sequence and reading them sequentially.

How many writes can Cassandra handle?

With a quite small load for Cassandra, like 10 000 writes/s, you can end up with 100 000 business operations stored, once again, only in memory, just like in Redis :) Before you start panicking and moving all your data to a good old RDBMS, hold on. Cassandra is a distributed DB.


2 Answers

  1. Increase JVM memory (max 12 GB on java 6+) - this will automatically increase size of memtables and reduce flush intervals. This means also, that frequent updates will be merged together in RAM and not during compaction - this will reduce disk usage as well. Like always there is disadvantage - cassandra will need more time to start, because commit log will get larger (it's removed when memtable is flushed into SSTable)
  2. VERY IMPORTANT: use separate disk for data and for commit log. You could use SSD for data. It makes no sence for commit log, because it's sequential write.
  3. Changing replication factor to 1 will generate less load in cluster, because each node will have to take care of its data and not additionally replicas, but you might lose data - I would not recomend it.

This might help to get some better understanding:

http://maciej-miklas.blogspot.de/2012/09/cassanrda-tuning-for-frequent-column.html

http://maciej-miklas.blogspot.de/2012/08/cassandra-11-reading-and-writing-from.html

like image 172
Maciej Miklas Avatar answered Nov 15 '22 17:11

Maciej Miklas


In addition to Maciej's good points, I would add at a higher level that using batches to bulk load normal writes is an antipattern. Its main effect is to make your workload more "bursty" which is Bad. Use batches only when you have writes that need to be done together for consistency.

For bulk load, consider batching them at the source and using sstableloader, but I wouldn't recommend investing that effort until the ~100M row level.

like image 29
jbellis Avatar answered Nov 15 '22 18:11

jbellis