What is difference between these two kinds of tables in Cassandra?
First :
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
volts2 float,
PRIMARY KEY (sensor_id, collected_at,volts )
)
and Second:
CREATE TABLE data (
sensor_id int,
collected_at timestamp,
volts float,
volts2 float,
PRIMARY KEY ((sensor_id, collected_at),volts )
)
My questions:
The difference is the primary key. Cassandra primary key is divided in (Partition Key, Clustering Key).
Partition Key decides where a register goes within the ring and Clustering determines how the registers with same partition key are stored to make use of the on-disk sorting of columns in your queries.
First table:
Second table:
Imagine you have billions of registers for the same sensor_id. Using the first approach you will store it in the same node so probably you will run out of space. If you use the second approach you will have to query using an exact sensor_id and collected_at timestamp so probably it doesn't make sense. Because of that in Cassandra modeling you must know what queries are you going to execute before create the model.
The first table partitions data on sensor_id
only. Meaning, that all data underneath each sensor_id
is stored in the same data partition. The hashed token value of sensor_id
also determines which node(s) in the cluster the data partition is stored on. Data within each partition is sorted by collected_at
and volts
.
The second table uses a composite key on both sensor_id
and collected_at
to determine data partitioning. Data in each partition is sorted by volts
.
When we use first table and when we use the second table ?
As you have to pass all of your partition keys in a query, the first table offers more query flexibility. That is, you can decide to query only on sensor_id
, and then you can choose whether or not to also query by collected_at
and then volts
. In the second table, you have to query by both sensor_id
and collected_at
. So you have less query flexibility, but you get better data distribution out of the second model.
And actually, partitioning on a timestamp (second table) value is typically not very useful, because you would have to have that exact timestamp before executing your query. Typically what you see when timestamp components are used in a partition key, is in a technique called "date bucketing," in which you would use something with less precision like month or day. That way, you could still query for an entire month/day or whatever your bucket was.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With