Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CQL with a wide row - how to get most recent set?

How would I write the CQL to get the most recent set of data from each row?

I'm investigating transitioning from MSSQL to Cassandra and am starting to grasp the concepts. Lots of research has help tremendously, but I haven't found answer to this (I know there must be a way):

CREATE TABLE WideData {
 ID text,
 Updated timestamp,
 Title text,
 ReportData text,
 PRIMARY KEY (ID, Updated)
} WITH CLUSTERING ORDER (Updated DESC) 

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title', 'Blah blah blah blah')
INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('bbb', NOW, 'Title', 'Blah blah blah blah')

wait 1 minute:

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('bbb', NOW, 'Title 2', 'Blah blah blah blah')

wait 3 minutes:

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title 2', 'Blah blah blah blah')

wait 5 minutes:

INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title 3', 'Blah blah blah blah')

How would I write the CQL to get the most recent set of data from each row?

SELECT ID, Title FROM WideRow - gives me 5 rows, as it pivots the data for me.

Essentially I want the results for (SELECT ID, Title FROM WideRow WHERE .....) to be:

ID   Title
aaa, Title3
bbb, Title2

Also, is there a way to get a count of the number of data sets in a wide row?

Essentially the equivalent of TSQL: SELECT ID, Count(*) FROM Table GROUP BY ID

ID   Count
aaa  3
bbb  2

Thanks

Also, any references to learn more about these types of queries would also be appreciated.

like image 881
Carol AndorMarten Liebster Avatar asked Sep 28 '22 12:09

Carol AndorMarten Liebster


1 Answers

With your current data model, you can only query the most-recent row by partition key. In your case, that is ID.

SELECT ID, Title FROM WideData WHERE ID='aaa' LIMIT 1

Since you have indicated your clustering order on Updated in DESCending order, the row with the most-recent Updated timestamp will be returned first.

Given your desired results, I'll go ahead and assume that you do not want to query each partition key individually. Cassandra only maintains CQL result set order by partition key. Also Cassandra does not support aggregation. So there really is no way to get the "most recent" for all of your IDs together at once, nor is there a way to get a report of how many updates each ID has.

With Cassandra data modeling, you need to build your tables to suit your queries. Query "planning" is not really a strong point of Cassandra (as you are finding out). To get the most-recent updates by ID, you would need to build an additional query table designed to store only the most-recent update for each ID. Likewise, to get the count of updates for each ID you could create an additonal query table using counter coulmns to suit that query.

tl;dr

In Cassandra, denormalization and redundant data storage is the key. For some applications, you might have one table for each query you need to support...and that's ok.

like image 192
Aaron Avatar answered Oct 06 '22 01:10

Aaron