Let's say that I have the following table in Cassandra:
customer_bought_product (
store_id uuid,
product_id text,
order_time timestamp,
email text,
first_name text,
last_name text,
PRIMARY KEY ((store_id, product_id), order_time, email)
The partition keys are store_id
and order_id
and it is used in order to store time series data.
The data does not have a TTL
, as it should be accessible at all times.
In some cases we may require to delete all of the data for a given store_id
.
What is the best practice to do it?
So far I have thought of the following solutions:
store_id
. - The downside is that this will take more and more time as we insert more data in the table.store_id
, get the keys from it and create a delete statement for each or those keys. - I do not like this concept, because I have to maintain the records.Has anyone encountered this problem? What is the best practice to clear unused records from Cassandra (excluding TTL
)?
You simply cannot alter the primary key of a Cassandra table. You need to create another table with your new schema and perform a data migration.
A primary key in Cassandra represents both a unique data partition and a data arrangement inside a partition. Data arrangement information is provided by optional clustering columns. Each unique partition key represents a set of table rows managed in a server, as well as all servers that manage its replicas.
The partition key is responsible for distributing data among nodes. A partition key is the same as the primary key when the primary key consists of a single column. Partition keys belong to a node. Cassandra is organized into a cluster of nodes, with each node having an equal part of the partition key hashes.
The best performance from Cassandra can be realized when the data needed to satisfy a particular query is located in the same partition key. Cassandra stores an entire row of data on a node by this partition key. So, to spread the data over multiple nodes, you define a composite partition key.
Cassandra deletes data in each selected partition atomically and in isolation. Deleted data is not removed from disk immediately. Cassandra marks the deleted data with a tombstone and then removes it after the grace period.
Clustering keys are things we add to the primary key. That gives the order to that partition of rows. And in this case, we’re sorting them by release year. Partition key, clustering key, together they make up the primary key and that is, if you will, a key part of table design in Cassandra. Cassandra uses the first column name as the partition key.
Deleted data is not removed from disk immediately. Cassandra marks the deleted data with a tombstone and then removes it after the grace period. CAUTION: Using delete may impact performance. DELETE [column_name (term)] [, ...]
Create a materialized view to store the product_ids that belong to a corresponding store_ids. This way you can query the MV for a given store_id and then delete the corresponding rows from the main table. This way additional application code could be avoided to maintain two different tables.
create materialized view mv_customer_bought_product
as select product_id, store_id, order_time, email
from customer_bought_product
where order_time is not null
and email is not null
and product_id is not null
and store_id is not null
primary key (store_id, product_id, order_time, email) ;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With