How would I write the CQL to get the most recent set of data from each row?
I'm investigating transitioning from MSSQL to Cassandra and am starting to grasp the concepts. Lots of research has help tremendously, but I haven't found answer to this (I know there must be a way):
CREATE TABLE WideData {
ID text,
Updated timestamp,
Title text,
ReportData text,
PRIMARY KEY (ID, Updated)
} WITH CLUSTERING ORDER (Updated DESC)
INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title', 'Blah blah blah blah')
INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('bbb', NOW, 'Title', 'Blah blah blah blah')
wait 1 minute:
INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('bbb', NOW, 'Title 2', 'Blah blah blah blah')
wait 3 minutes:
INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title 2', 'Blah blah blah blah')
wait 5 minutes:
INSERT INTO WideData (ID, Updated, Title, ReportData) VALUES ('aaa', NOW, 'Title 3', 'Blah blah blah blah')
How would I write the CQL to get the most recent set of data from each row?
SELECT ID, Title FROM WideRow - gives me 5 rows, as it pivots the data for me.
Essentially I want the results for (SELECT ID, Title FROM WideRow WHERE .....) to be:
ID Title
aaa, Title3
bbb, Title2
Also, is there a way to get a count of the number of data sets in a wide row?
Essentially the equivalent of TSQL: SELECT ID, Count(*) FROM Table GROUP BY ID
ID Count
aaa 3
bbb 2
Thanks
Also, any references to learn more about these types of queries would also be appreciated.
With your current data model, you can only query the most-recent row by partition key. In your case, that is ID
.
SELECT ID, Title FROM WideData WHERE ID='aaa' LIMIT 1
Since you have indicated your clustering order on Updated
in DESCending order, the row with the most-recent Updated
timestamp will be returned first.
Given your desired results, I'll go ahead and assume that you do not want to query each partition key individually. Cassandra only maintains CQL result set order by partition key. Also Cassandra does not support aggregation. So there really is no way to get the "most recent" for all of your ID
s together at once, nor is there a way to get a report of how many updates each ID
has.
With Cassandra data modeling, you need to build your tables to suit your queries. Query "planning" is not really a strong point of Cassandra (as you are finding out). To get the most-recent updates by ID
, you would need to build an additional query table designed to store only the most-recent update for each ID. Likewise, to get the count of updates for each ID
you could create an additonal query table using counter coulmns to suit that query.
tl;dr
In Cassandra, denormalization and redundant data storage is the key. For some applications, you might have one table for each query you need to support...and that's ok.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With