Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra - What is the reasonable maximum number of tables?

I am new to Cassandra. As I understand the maximum number of tables that can be stored per keyspace is Integer.Max_Value. However, what are the implications from the performance perspective (speed, storage, etc) of such a big number of tables? Is there any recommendation regarding that?

like image 927
Altober Avatar asked Oct 19 '15 10:10

Altober


2 Answers

While there are legitimate use cases for having lots of tables in Cassandra, they are rare. Your use case might be one of them, but make sure that it is. Without knowning more about the problem you're trying to solve, it's obviously hard to give guidance. Many tables will require more resources, obviously. How much? That depends on the settings, and the usage.

For example, if you have a thousand tables and write to all of them at the same time there will be contention for RAM since there will be memtables for each of them, and there is a certain overhead for each memtable (how much depends on which version of Cassandra, your settings, etc.).

However, if you have a thousand tables but don't write to all of them at the same time, there will be less contention. There's still a per table overhead, but there will be more RAM to keep the active table's memtables around.

The same goes for disk IO. If you read and write to a lot of different tables at the same time the disk is going to do much more random IO.

Just having lots of tables isn't a big problem, even though there is a limit to how many you can have – you can have as many as you want provided you have enough RAM to keep the structures that keep track of them. Having lots of tables and reading and writing to them all at the same time will be a problem, though. It will require more resources than doing the same number of reads and writes to fewer tables.

like image 156
Theo Avatar answered Oct 19 '22 12:10

Theo


In my opinion if you can split the data into multiple tables, even thousands, is beneficial.

Pros:

  1. Suppose you want to scale in future to 10+ nodes and with a RF of 2 will result in having the data evenly distributed across nodes, thus not salable.
  2. Another point is random IO which will be big if you will read from many tables at the same time but I don't see why there is a difference when having just one table. Also you will seek for another partition key, so no difference in IO.
  3. When the compactation takes place it will have to do less work if there is only one table. The values from SSTables must be loaded into memory, merged and saved back.

Cons:

  1. Having multiple tables will result in having multiple memtables. I think the difference added by this to the RAM is insignificant.

Also, check out the links, they helped me A LOT
http://manuel.kiessling.net/2016/07/11/how-cassandras-inner-workings-relate-to-performance/
https://www.infoq.com/presentations/Apache-Cassandra-Anti-Patterns

Please fell free to edit my post, I am kinda new to Big Data

like image 26
Iacobescu Radu Avatar answered Oct 19 '22 12:10

Iacobescu Radu