Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delete data from Cassandra with part of the partition key

Let's say that I have the following table in Cassandra:

customer_bought_product (
    store_id uuid,
    product_id text,
    order_time timestamp,
    email text,
    first_name text,
    last_name text,
    PRIMARY KEY ((store_id, product_id), order_time, email)

The partition keys are store_id and order_id and it is used in order to store time series data.

The data does not have a TTL, as it should be accessible at all times.

In some cases we may require to delete all of the data for a given store_id. What is the best practice to do it?

So far I have thought of the following solutions:

  1. Write a program, that will select all of the data from the table and delete the records with the given store_id. - The downside is that this will take more and more time as we insert more data in the table.
  2. Leave the data in the table. - The only problem with doing this, is that we will have useless data.
  3. Store the table name with the available partition keys in a different table, that can be queried by store_id, get the keys from it and create a delete statement for each or those keys. - I do not like this concept, because I have to maintain the records.

Has anyone encountered this problem? What is the best practice to clear unused records from Cassandra (excluding TTL)?

like image 986
Ivan Stoyanov Avatar asked Jul 03 '17 15:07

Ivan Stoyanov


People also ask

Can we change partition key in Cassandra?

You simply cannot alter the primary key of a Cassandra table. You need to create another table with your new schema and perform a data migration.

Is partition key unique in Cassandra?

A primary key in Cassandra represents both a unique data partition and a data arrangement inside a partition. Data arrangement information is provided by optional clustering columns. Each unique partition key represents a set of table rows managed in a server, as well as all servers that manage its replicas.

How does Cassandra partition key work?

The partition key is responsible for distributing data among nodes. A partition key is the same as the primary key when the primary key consists of a single column. Partition keys belong to a node. Cassandra is organized into a cluster of nodes, with each node having an equal part of the partition key hashes.

What is a composite partition key in Cassandra?

The best performance from Cassandra can be realized when the data needed to satisfy a particular query is located in the same partition key. Cassandra stores an entire row of data on a node by this partition key. So, to spread the data over multiple nodes, you define a composite partition key.

How does Cassandra delete data from the disk?

Cassandra deletes data in each selected partition atomically and in isolation. Deleted data is not removed from disk immediately. Cassandra marks the deleted data with a tombstone and then removes it after the grace period.

What are clustering keys in Cassandra?

Clustering keys are things we add to the primary key. That gives the order to that partition of rows. And in this case, we’re sorting them by release year. Partition key, clustering key, together they make up the primary key and that is, if you will, a key part of table design in Cassandra. Cassandra uses the first column name as the partition key.

What happens to deleted data in Cassandra after the grace period?

Deleted data is not removed from disk immediately. Cassandra marks the deleted data with a tombstone and then removes it after the grace period. CAUTION: Using delete may impact performance. DELETE [column_name (term)] [, ...]


1 Answers

Create a materialized view to store the product_ids that belong to a corresponding store_ids. This way you can query the MV for a given store_id and then delete the corresponding rows from the main table. This way additional application code could be avoided to maintain two different tables.

create materialized view mv_customer_bought_product 
as select product_id, store_id, order_time, email 
from customer_bought_product 
where order_time is not null 
and email is not null 
and product_id is not null 
and store_id is not null 
primary key (store_id, product_id, order_time, email) ;
like image 198
dilsingi Avatar answered Oct 30 '22 00:10

dilsingi