Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get first row for each partition key in Cassandra

Tags:

cassandra

I am considering Cassandra as an intermediate storage during my ETL job to perform data deduplication.

Let's imagine I have a stream of events, each of them have some business entity id, timestamp and some value. I need to get only latest value in terms of in-event timestamp for each business key, but events may come unordered.

My idea was to create staging table with business id as a partition key and timestamp as a clustering key:

CREATE TABLE sample_keyspace.table1_copy1 (
 id uuid,
 time timestamp,
 value text,
 PRIMARY KEY (id, time)
) WITH CLUSTERING ORDER BY ( time DESC )

Now if I insert some data in this table I can get latest value for some given partition key:

select * from table1 where id = 96b29b4b-b60b-4be9-9fa3-efa903511f2d limit 1;

But that would require to issue such query for every business key I'm interested in.

Is there some effective way I could do it in CQL?

I know we have an ability to list all available partition keys (by select distinct id from table1). So if I look into storage model of Cassandra, getting first row for each partition key should not be too hard.

Is that supported?

like image 498
Yura Taras Avatar asked Nov 02 '16 12:11

Yura Taras


People also ask

How do I SELECT distinct rows in Cassandra?

In cassandra you can only select the distinct records from Partition Key column or columns. If Partition key consists of multiple columns, you have to provide all of the columns otherwise you will get an error.

Is primary key and partition key same in Cassandra?

Each table demands a unique primary key. In Cassandra, a primary key consists of one or more partition keys and may include clustering key components. The Apache Cassandra partition key always precedes the clustering key since its hashed value determines which node will store the data.

What is Cqlsh command?

cqlsh is a command-line interface for interacting with Cassandra using CQL (the Cassandra Query Language). It is shipped with every Cassandra package, and can be found in the bin/ directory alongside the cassandra executable.


1 Answers

If you're using a version after 3.6, there is an option on your query named PER PARTITION LIMIT (CASSANDRA-7017) which you can set to 1. This won't auto complete in cqlsh until 3.10 with CASSANDRA-12803.

SELECT * FROM table1 PER PARTITION LIMIT 1;
like image 189
Chris Lohfink Avatar answered Oct 26 '22 10:10

Chris Lohfink