Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Order latest records by timestamp in Cassandra

I'm trying to display the latest values from a list of sensors. The list should also be sortable by the time-stamp.

I tried two different approaches. I included the update time of the sensor in the primary key:

CREATE TABLE sensors (
    customerid int,
    sensorid int,
    changedate timestamp,
    value text,
    PRIMARY KEY (customerid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);

Then I can select the list like this:

select * from sensors where customerid=0 order by changedate desc;

which results in this:

 customerid | changedate               | sensorid | value
------------+--------------------------+----------+-------
          0 | 2015-07-10 12:46:53+0000 |        1 |     2
          0 | 2015-07-10 12:46:52+0000 |        1 |     1
          0 | 2015-07-10 12:46:52+0000 |        0 |     2
          0 | 2015-07-10 12:46:26+0000 |        0 |     1

The problem is, I don't get only the latest results, but all the old values too.

If I remove the changedate from the primary key, the select fails all together.

InvalidRequest: code=2200 [Invalid query] message="Order by is currently only supported on the clustered columns of the PRIMARY KEY, got changedate"

Updating the sensor values is also no option:

update overview set changedate=unixTimestampOf(now()), value = '5' where customerid=0 and sensorid=0;
InvalidRequest: code=2200 [Invalid query] message="PRIMARY KEY part changedate found in SET part"

This fails because changedate is part of the primary key.

Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?

Edit: In the meantime I tried another approach, to only storing the latest value.

I used this schema:

CREATE TABLE sensors (
    customerid int,
    sensorid int,
    changedate timestamp,
    value text,
    PRIMARY KEY (customerid, sensorid, changedate)
) WITH CLUSTERING ORDER BY (changedate DESC);

Before inserting the latest value, I would delete all old values

DELETE FROM sensors WHERE customerid=? and sensorid=?;

But this fails because changedate is NOT part of the WHERE clause.

like image 889
user5102859 Avatar asked Jul 10 '15 13:07

user5102859


People also ask

How do I ORDER BY Cassandra?

You can fine-tune the display order using the ORDER BY clause. The partition key must be defined in the WHERE clause and the ORDER BY clause defines the clustering column to use for ordering. cqlsh> CREATE TABLE cycling.

How is timestamp stored in Cassandra?

Values of the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as the epoch: January 1 1970 at 00:00:00 GMT. Timestamps can be input in CQL either using their value as an integer , or using a string that represents an ISO 8601 date.

Does Cassandra support sorting?

Cassandra supports sorting using the clustering columns. When you create a table, you can define clustering columns which will be used to sort the data inside each partition in either ascending or descending orders. Then you can easily use the ORDER BY clause with the ASC or DESC options.

What is clustering order in Cassandra?

Ordering query results to make use of the on-disk sorting of columns. You can order query results to make use of the on-disk sorting of columns. You can order results in ascending or descending order. The ascending order will be more efficient than descending.


2 Answers

The problem is, I don't get only the latest results, but all the old values too.

Since you are storing in a CLUSTERING ORDER of DESC, it will always be very easy to get the latest records, all you need to do is add 'LIMIT' to your query, i.e.:

select * from sensors where customerid=0 order by changedate desc limit 10;

Would return you at most 10 records with the highest changedate. Even though you are using limit, you are still guaranteed to get the latest records since your data is ordered that way.

If I remove the changedate from the primary key, the select fails all together.

This is because you cannot order on a column that is not the clustering key(s) (the secondary part of the primary key) except maybe with a secondary index, which I would not recommend.

Updating the sensor values is also no option

Your update query is failing because it is not legal to include part of the primary key in 'set'. To make this work all you need to do is update your query to include changedate in the where clause, i.e.:

update overview set value = '5' and sensorid = 0 where customerid=0 and changedate=unixTimestampOf(now())

Is there any possible way to store only the latest values from each sensor and also keep the table ordered by the time-stamp?

You can do this by creating a separate table named 'latest_sensor_data' with the same table definition with exception to the primary key. The primary key will now be 'customerid, sensorid' so you can only have 1 record per sensor. The process of creating separate tables is called denormalization and is a common use pattern particularly in Cassandra data modeling. When you insert sensor data you would now insert data into both 'sensors' and 'latest_sensor_data'.

CREATE TABLE latest_sensor_data (
  customerid int,
  sensorid int,
  changedate timestamp,
  value text,
  PRIMARY KEY (customerid, sensorid)
);

In cassandra 3.0 'materialized views' will be introduced which will make this unnecessary as you can use materialized views to accomplish this for you.

Now doing the following query:

select * from latest_sensor_data where customerid=0

Will give you the latest value for every sensor for that customer.

I would recommend renaming 'sensors' to 'sensor_data' or 'sensor_history' to make it more clear what the data is. Additionally you should change the primary key to 'customerid, changedate, sensorid' as that would allow you to have multiple sensors at the same date (which seems possible).

like image 61
Andy Tolbert Avatar answered Oct 21 '22 04:10

Andy Tolbert


Your first approach looks reasonable. If you add "limit 1" to your query, you would only get the latest result, or limit 2 to see the latest 2 results, etc.

If you want to automatically remove old values from the table, you can specify a TTL (Time To Live) for data points when you do the insert. So if you wanted to keep data points for 10 days, you could do this by adding "USING TTL 864000" on your insert statements. Or you could set a default TTL for the entire table.

like image 2
Jim Meyer Avatar answered Oct 21 '22 04:10

Jim Meyer