I am trying to calculate the the partition size for each row in a table with arbitrary amount of columns and types using a formula from the Datastax Academy Data Modeling Course.
In order to do that I need to know the "size in bytes" for some common Cassandra data types. I tried to google this but I get a lot of suggestions so I am puzzled.
The data types I would like to know the byte size of are:
Any other considerations would of course also be appreciated regarding data types sizes in Cassandra.
Adding more info since it seems confusing to understand that I am only trying to estimate the "worst scenario disk usage" the data would occupy with out any compressions and other optimizations done by Cassandra behinds the scenes.
I am following the Datastax Academy Course DS220 (see link at end) and implement the formula and will use the info from answers here as variables in that formula.
https://academy.datastax.com/courses/ds220-data-modeling/physical-partition-size
Partition size is measured by the number of cells (values) that are stored in the partition. Cassandra's hard limit is 2 billion cells per partition, but you'll likely run into performance issues before reaching that limit.
To calculate the size of a row, we need to sum the size of all columns within the row and add that sum to the partition key size. Assuming the size of the partition key is consistent throughout a table, calculating the size of a table is almost identical to calculating the size of a partition.
As we learned earlier, Cassandra uses a consistent hashing technique to generate the hash value of the partition key (app_name) and assign the row data to a partition range inside a node.
For example if you had a 6-node DC with a replication factor of 3 and each node had about 100GB of data each, the total size is: table_size = ( 100GB + 100GB + 100GB + 100GB + 100GB + 100GB ) / 3. = 600GB / 3. = 200GB.
I think, from a pragmatic point of view, that it is wise to get a back-of-the-envelope estimate of worst case using the formulae in the ds220 course up-front at design time. The effect of compression often varies depending on algorithms and patterns in the data. From ds220 and http://cassandra.apache.org/doc/latest/cql/types.html:
uuid: 16 bytes
timeuuid: 16 bytes
timestamp: 8 bytes
bigint: 8 bytes
counter: 8 bytes
double: 8 bytes
time: 8 bytes
inet: 4 bytes (IPv4) or 16 bytes (IPV6)
date: 4 bytes
float: 4 bytes
int 4 bytes
smallint: 2 bytes
tinyint: 1 byte
boolean: 1 byte (hopefully.. no source for this)
ascii: equires an estimate of average # chars * 1 byte/char
text/varchar: requires an estimate of average # chars * (avg. # bytes/char for language)
map/list/set/blob: an estimate
hope it helps
The only reliable way to estimate the overhead associated to something is to actually perform measures. Really, you can't take the single data types and generalize something about them. If you have 4 bigint
s columns and you're supposing that your overhead is X, if you have 400 bigint
columns your overhead won't probably be 100x. That's because Cassandra compresses (by default, and it's a settings tunable per column family) everything before storing data on disk.
Try to load some data, I mean production data, in the cluster, and then let's know your results and compression configuration. You'd find some surprises.
Know your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With