Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra partition key for time series data

I'm testing Cassandra as time series database.

I create data model as below:

CREATE KEYSPACE sm WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 1
};

USE sm;

CREATE TABLE newdata (timestamp timestamp,
  deviceid int, tagid int,
  decvalue decimal,
  alphavalue text,
  PRIMARY KEY (deviceid,tagid,timestamp));

In the Primary key, I set deviceid as the partition key which mean all data with same device id will write into one node (does it mean one machine or one partition. Each partition can have max 2 billion rows) also if I query data within the same node, the retrieval will be fast, am I correct? I’m new to Cassandra and a bit confused about the partition key and clustering key.

Most of my query will be as below:

  • select lastest timestamp of know deviceid and tagid
  • Select decvalue of known deviceid and tagid and timestamp
  • Select alphavalue of known deviceid and tagid and timestamp
  • select * of know deviceid and tagid with time range
  • select * of known deviceid with time range

I will have around 2000 deviceid, each deviceid will have 60 tagid/value pair. I'm not sure if it will be a wide rows of deviceid, timestamp, tagid/value, tagid/value....

like image 561
Phuong Le Avatar asked Mar 16 '16 23:03

Phuong Le


1 Answers

I’m new to Cassandra and a bit confused about the partition key and clustering key.

It sounds like you understand partition keys, so I'll just add that your partition key helps Cassandra figure out where (which token range) in the cluster to store your data. Each node is responsible for several primary token ranges (assuming vnodes). When your data is written to a data partition, it is sorted by your clustering keys. This is also how it is stored on-disk, so remember that your clustering keys determine the order in which your data is stored on disk.

Each partition can have max 2 billion rows

That's not exactly true. Each partition can support up to 2 billion cells. A cell is essentially a column name/value pair. And your clustering keys add up to a single cell by themselves. So compute your cells by counting the column values that you store for each CQL row, and add one more if you use clustering columns.

Depending on your wide row structure you will probably have a limitation of far fewer than 2 billion rows. Additionally, that's just the storage limitation. Even if you managed to store 1 million CQL rows in a single partition, querying that partition would return so much data that it would be ungainly and probably time-out.

if I query data within the same node, the retrieval will be fast, am I correct?

It'll at least be faster than multi-key queries that hit multiple nodes. But whether or not it will be "fast" depends on other things, like how wide your rows are, and how often you do things like deletes and in-place updates.

Most of my query will be as below:

select lastest timestamp of know deviceid and tagid
Select decvalue of known deviceid and tagid and timestamp
Select alphavalue of known deviceid and tagid and timestamp
select * of know deviceid and tagid with time range
select * of known deviceid with time range

Your current data model can support all of those queries, except for the last one. In order to perform a range query on timestamp, you'll need to duplicate your data into a new table, and build a PRIMARY KEY to support that query pattern. This is called "query-based modeling." I would build a query table like this:

CREATE TABLE newdata_by_deviceid_and_time (
  timestamp timestamp,
  deviceid int,
  tagid int,
  decvalue decimal,
  alphavalue text,
  PRIMARY KEY (deviceid,timestamp));

That table can support a range query on timestamp, while partitioning on deviceid.

But the biggest problem I see with either of these models, is that of "unbounded row growth." Basically, as you collect more and more values for your devices, you will approach the 2 billion cell limit per partition (and again, things will probably get slow way before that). What you need to do, is use a modeling technique called "time bucketing."

For the example, I'll say that I determined that bucketing by month would keep me well under the 2 billion cells limit and allow for the type of date range flexibility that I needed. If so, I would add an additional partition key monthbucket and my (new) table would look like this:

CREATE TABLE newdata_by_deviceid_and_time (
  timestamp timestamp,
  deviceid int,
  tagid int,
  decvalue decimal,
  alphavalue text,
  monthbucket text,
  PRIMARY KEY ((deviceid,monthbucket),timestamp));

Now when I wanted to query for data in a specific device and date range, I would also specify the monthbucket:

SELECT * FROM newdata_by_deviceid_and_time
WHERE deviceid='AA23' AND monthbucket='201603'
AND timestamp >= '2016-03-01 00:00:00-0500'
AND timestamp < '2016-03-16 00:00:00-0500';

Remember, monthbucket is just an example. For you, it may make more sense to use quarter or even year (assuming that you don't store too many values per deviceid in a year).

like image 160
Aaron Avatar answered Sep 28 '22 05:09

Aaron