Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What do nested parenthesis indicate in a PRIMARY KEY definition

What is difference between these two kinds of tables in Cassandra?

First :

CREATE TABLE data (
    sensor_id int,
    collected_at timestamp,
    volts float,
    volts2 float,
    PRIMARY KEY (sensor_id, collected_at,volts )
)

and Second:

CREATE TABLE data (
    sensor_id int,
    collected_at timestamp,
    volts float,
    volts2 float,
    PRIMARY KEY ((sensor_id, collected_at),volts )
)

My questions:

  1. What is difference between these two tables?
  2. When would we use first table, and when would we use the second table?
like image 862
Senat Avatar asked Mar 29 '16 07:03

Senat


2 Answers

The difference is the primary key. Cassandra primary key is divided in (Partition Key, Clustering Key).

Partition Key decides where a register goes within the ring and Clustering determines how the registers with same partition key are stored to make use of the on-disk sorting of columns in your queries.

First table:

  • Sensor_id is your Partition Key so you know every register with the same sensor_id will go to the same node.
  • You have two clustering keys, collected_at and volts fields so data with the same sensor_id will be stored ordered by collected_at field in ascending order and data with same sendor_id, collected_at fields will be stored ordered by volts field in ascending order.

Second table:

  • You will have a compound Partition Key (sensor_id, collected_at) so you know every register with the same sensor_id and collected_at will go to the same node.
  • Your clustering key is volts so data with same (sensor_id, collected_at) will be stored ordered by volts in ascending order.

Imagine you have billions of registers for the same sensor_id. Using the first approach you will store it in the same node so probably you will run out of space. If you use the second approach you will have to query using an exact sensor_id and collected_at timestamp so probably it doesn't make sense. Because of that in Cassandra modeling you must know what queries are you going to execute before create the model.

like image 154
gasparms Avatar answered Sep 21 '22 18:09

gasparms


The first table partitions data on sensor_id only. Meaning, that all data underneath each sensor_id is stored in the same data partition. The hashed token value of sensor_id also determines which node(s) in the cluster the data partition is stored on. Data within each partition is sorted by collected_at and volts.

The second table uses a composite key on both sensor_id and collected_at to determine data partitioning. Data in each partition is sorted by volts.

When we use first table and when we use the second table ?

As you have to pass all of your partition keys in a query, the first table offers more query flexibility. That is, you can decide to query only on sensor_id, and then you can choose whether or not to also query by collected_at and then volts. In the second table, you have to query by both sensor_id and collected_at. So you have less query flexibility, but you get better data distribution out of the second model.

And actually, partitioning on a timestamp (second table) value is typically not very useful, because you would have to have that exact timestamp before executing your query. Typically what you see when timestamp components are used in a partition key, is in a technique called "date bucketing," in which you would use something with less precision like month or day. That way, you could still query for an entire month/day or whatever your bucket was.

like image 23
Aaron Avatar answered Sep 24 '22 18:09

Aaron