Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating the size of a table in Cassandra

Tags:

cassandra

In "Cassandra The Definitive Guide" (2nd edition) by Jeff Carpenter & Eben Hewitt, the following formula is used to calculate the size of a table on disk (apologies for the blurred part):

table size equation

  • ck: primary key columns
  • cs: static columns
  • cr: regular columns
  • cc: clustering columns
  • Nr: number of rows
  • Nv: it's used for counting the total size of the timestamps (I don't get this part completely, but for now I'll ignore it).

There are two things I don't understand in this equation.

First: why do clustering columns size gets counted for every regular column? Shouldn't we multiply it by the number of rows? It seems to me that by calculating this way, we're saying that the data in each clustering column, gets replicated for each regular column, which I suppose is not the case.

Second: why do primary key columns don't get multiplied by the number of partitions? From my understanding, if we have a node with two partitions, then we should multiply the size of the primary key columns by two because we'll have two different primary keys in that node.

like image 394
Pedro Gordo Avatar asked Mar 11 '17 13:03

Pedro Gordo


People also ask

How do I find the size of a Keyspace in Cassandra?

If you need to know informaiton about table or tables you can use Nodetool cfstats command. Syntax: If you will only provide the name of keyspace, it will provide stats for all the tables in that keyspace.

What is partition size in Cassandra?

Partition size is measured by the number of cells (values) that are stored in the partition. Cassandra's hard limit is 2 billion cells per partition, but you'll likely run into performance issues before reaching that limit.

What is table called in Cassandra?

As mentioned earlier, a table is also called Column Family in the earlier versions of Cassandra. It is still referred to as column family in some of the error messages and documents of Cassandra. It is important to define a primary key for a table.


2 Answers

As the author, I greatly appreciate the question and your engagement with the material!

With respect to the original questions - remember that this is not the formula to calculate the size of the table, it is the formula to calculate the size of a single partition. The intent is to use this formula with "worst case" number of rows to identify overly large partitions. You'd need to multiply the result of this equation by the number of partitions to get an estimate of total data size for the table. And of course this does not take replication into account.

Also thanks to those who responded to the original question. Based on your feedback I spent some time looking at the new (3.0) storage format to see whether that might impact the formula. I agree that Aaron Morton's article is a helpful resource (link provided above).

The basic approach of the formula remains sound for the 3.0 storage format. The way the formula works, you're basically adding:

  • the sizes of the partition key and static columns
  • the size of the clustering columns per row, times the number of rows
  • 8 bytes of metadata for each cell

Updating the formula for the 3.0 storage format requires revisiting the constants. For example, the original equation assumes 8 bytes of metadata per cell to store a timestamp. The new format treats the timestamp on a cell as optional since it can be applied at the row level. For this reason, there is now a variable amount of metadata per cell, which could be as low as 1-2 bytes, depending on the data type.

After reading this feedback and rereading that section of the chapter, I plan to update the text to add some clarifications as well as stronger caveats about this formula being useful as an approximation rather than an exact value. There are factors it doesn't account for at all such as writes being spread over multiple SSTables, as well as tombstones. We're actually planning another printing this spring (2017) to correct a few errata, so look for those changes soon.

like image 59
Jeff Carpenter Avatar answered Sep 24 '22 21:09

Jeff Carpenter


It's because of Cassandra's version < 3 internal structure.

  • There is only one entry for each distinct partition key value.
  • For each distinct partition key value there is only one entry for static column
  • There is an empty entry for the clustering key
  • For each column in a row there is a single entry for each clustering key column

Let's take an example :

CREATE TABLE my_table (
    pk1 int,
    pk2 int,
    ck1 int,
    ck2 int,
    d1 int,
    d2 int,
    s int static,
    PRIMARY KEY ((pk1, pk2), ck1, ck2)
); 

Insert some dummy data :

 pk1 | pk2 | ck1 | ck2  | s     | d1     | d2
-----+-----+-----+------+-------+--------+---------
   1 |  10 | 100 | 1000 | 10000 | 100000 | 1000000
   1 |  10 | 100 | 1001 | 10000 | 100001 | 1000001
   2 |  20 | 200 | 2000 | 20000 | 200000 | 2000001

Internal structure will be :

             |100:1000:  |100:1000:d1|100:1000:d2|100:1001:  |100:1001:d1|100:1001:d2|  
-----+-------+-----------+-----------+-----------+-----------+-----------+-----------+
1:10 | 10000 |           |  100000   |  1000000  |           |  100001   |  1000001  |


             |200:2000:  |200:2000:d1|200:2000:d2|
-----+-------+-----------+-----------+-----------+ 
2:20 | 20000 |           |  200000   |  2000000  |

So size of the table will be :

Single Partition Size = (4 + 4 + 4 + 4) + 4 + 2 * ((4 + (4 + 4)) + (4 + (4 + 4))) byte = 68 byte

Estimated Table Size = Single Partition Size * Number Of Partition 
                     = 68 * 2 byte
                     = 136 byte
  • Here all of the field type is int (4 byte)
  • There is 4 primary key column, 1 static column, 2 clustering key column and 2 regular column

More : http://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/

like image 28
Ashraful Islam Avatar answered Sep 25 '22 21:09

Ashraful Islam