Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the byte size of common Cassandra data types - To be used when calculating partition disk usage?

I am trying to calculate the the partition size for each row in a table with arbitrary amount of columns and types using a formula from the Datastax Academy Data Modeling Course.

In order to do that I need to know the "size in bytes" for some common Cassandra data types. I tried to google this but I get a lot of suggestions so I am puzzled.

The data types I would like to know the byte size of are:

  • A single Cassandra TEXT character (I googled answers from 2 - 4 bytes)
  • A Cassandra DECIMAL
  • A Cassandra INT (I suppose it is 4 bytes)
  • A Cassandra BIGINT (I suppose it is 8 bytes)
  • A Cassandra BOOELAN (I suppose it is 1 byte, .. or is it a single bit)

Any other considerations would of course also be appreciated regarding data types sizes in Cassandra.

Adding more info since it seems confusing to understand that I am only trying to estimate the "worst scenario disk usage" the data would occupy with out any compressions and other optimizations done by Cassandra behinds the scenes.

I am following the Datastax Academy Course DS220 (see link at end) and implement the formula and will use the info from answers here as variables in that formula.

https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size

like image 768
nicgul Avatar asked Oct 17 '16 13:10

nicgul


People also ask

What is partition size in Cassandra?

Partition size is measured by the number of cells (values) that are stored in the partition. Cassandra's hard limit is 2 billion cells per partition, but you'll likely run into performance issues before reaching that limit.

How are Cassandra table sizes calculated?

To calculate the size of a row, we need to sum the size of all columns within the row and add that sum to the partition key size. Assuming the size of the partition key is consistent throughout a table, calculating the size of a table is almost identical to calculating the size of a partition.

How is data partitioned in Cassandra?

As we learned earlier, Cassandra uses a consistent hashing technique to generate the hash value of the partition key (app_name) and assign the row data to a partition range inside a node.

How do I find the size of my Cassandra database?

For example if you had a 6-node DC with a replication factor of 3 and each node had about 100GB of data each, the total size is: table_size = ( 100GB + 100GB + 100GB + 100GB + 100GB + 100GB ) / 3. = 600GB / 3. = 200GB.


2 Answers

I think, from a pragmatic point of view, that it is wise to get a back-of-the-envelope estimate of worst case using the formulae in the ds220 course up-front at design time. The effect of compression often varies depending on algorithms and patterns in the data. From ds220 and http://cassandra.apache.org/doc/latest/cql/types.html:

uuid: 16 bytes
timeuuid: 16 bytes
timestamp: 8 bytes
bigint: 8 bytes
counter: 8 bytes
double: 8 bytes
time: 8 bytes
inet: 4 bytes (IPv4) or 16 bytes (IPV6)
date: 4 bytes
float: 4 bytes
int 4 bytes
smallint: 2 bytes
tinyint: 1 byte
boolean: 1 byte (hopefully.. no source for this)
ascii: equires an estimate of average # chars * 1 byte/char
text/varchar: requires an estimate of average # chars * (avg. # bytes/char for language)
map/list/set/blob: an estimate

hope it helps

like image 78
James Fremen Avatar answered Sep 22 '22 19:09

James Fremen


The only reliable way to estimate the overhead associated to something is to actually perform measures. Really, you can't take the single data types and generalize something about them. If you have 4 bigints columns and you're supposing that your overhead is X, if you have 400 bigint columns your overhead won't probably be 100x. That's because Cassandra compresses (by default, and it's a settings tunable per column family) everything before storing data on disk.

Try to load some data, I mean production data, in the cluster, and then let's know your results and compression configuration. You'd find some surprises.

Know your data.

like image 22
xmas79 Avatar answered Sep 22 '22 19:09

xmas79