Let's say that I have the following table in Cassandra: <pre class="prettyprint"><code>customer_bought_product ( store_id uuid, product_id text, order_time timestamp, email text, first_name text, last_name text, PRIMARY KEY ((store_id, product_id), order_time, email) </code></pre> The partition keys are <code>store_id</code> and <code>order_id</code> and it is used in order to store time series data. The data does not have a <code>TTL</code>, as it should be accessible at all times. In some cases we may require to delete all of the data for a given <code>store_id</code>. What is the best practice to do it? So far I have thought of the following solutions: <ol> <li>Write a program, that will select all of the data from the table and delete the records with the given <code>store_id</code>. - The downside is that this will take more and more time as we insert more data in the table.</li> <li>Leave the data in the table. - The only problem with doing this, is that we will have useless data.</li> <li>Store the table name with the available partition keys in a different table, that can be queried by <code>store_id</code>, get the keys from it and create a delete statement for each or those keys. - I do not like this concept, because I have to maintain the records.</li> </ol> Has anyone encountered this problem? What is the best practice to clear unused records from Cassandra (excluding <code>TTL</code>)?

Create a materialized view to store the product_ids that belong to a corresponding store_ids. This way you can query the MV for a given store_id and then delete the corresponding rows from the main table. This way additional application code could be avoided to maintain two different tables. <pre class="prettyprint"><code>create materialized view mv_customer_bought_product as select product_id, store_id, order_time, email from customer_bought_product where order_time is not null and email is not null and product_id is not null and store_id is not null primary key (store_id, product_id, order_time, email) ; </code></pre>

Delete data from Cassandra with part of the partition key

Tags:

cassandra

cassandra-3.0

Let's say that I have the following table in Cassandra:

customer_bought_product (
    store_id uuid,
    product_id text,
    order_time timestamp,
    email text,
    first_name text,
    last_name text,
    PRIMARY KEY ((store_id, product_id), order_time, email)

The partition keys are store_id and order_id and it is used in order to store time series data.

The data does not have a TTL, as it should be accessible at all times.

In some cases we may require to delete all of the data for a given store_id. What is the best practice to do it?

So far I have thought of the following solutions:

Write a program, that will select all of the data from the table and delete the records with the given store_id. - The downside is that this will take more and more time as we insert more data in the table.
Leave the data in the table. - The only problem with doing this, is that we will have useless data.
Store the table name with the available partition keys in a different table, that can be queried by store_id, get the keys from it and create a delete statement for each or those keys. - I do not like this concept, because I have to maintain the records.

Has anyone encountered this problem? What is the best practice to clear unused records from Cassandra (excluding TTL)?

986

asked Jul 03 '17 15:07

Ivan Stoyanov

1 Answers

Create a materialized view to store the product_ids that belong to a corresponding store_ids. This way you can query the MV for a given store_id and then delete the corresponding rows from the main table. This way additional application code could be avoided to maintain two different tables.

create materialized view mv_customer_bought_product 
as select product_id, store_id, order_time, email 
from customer_bought_product 
where order_time is not null 
and email is not null 
and product_id is not null 
and store_id is not null 
primary key (store_id, product_id, order_time, email) ;

198

answered Oct 30 '22 00:10

dilsingi

Related questions
                            
                                Cassandra: NoSpamLogger log Maximum memory usage reached
                            
                                DataStax cassandra core drive dependents on vulnerable Guava-19
                            
                                How to represent spatial data in Cassandra
                            
                                Change Helenus Consistency Level in CQL query
                            
                                Cassandra + Solr/Hadoop/Spark - Choosing the right tools
                            
                                Cassandra cluster with bad insert performance and insert stability
                            
                                Apache Phoenix vs Hive-Spark
                            
                                Cassandra: Long Par New GC Pauses when Bootstrapping new nodes to cluster
                            
                                Order latest records by timestamp in Cassandra
                            
                                Dealing with Cassandra Timestamp
                            
                                How does Cassandra store null values?
                            
                                Cassandra Batch statement-Multiple tables
                            
                                Tool for migrating data from Cassandra to MySQL?
                            
                                Storage for millions of images [closed]
                            
                                Is it possible to use different ports for nodes in a Cassandra cluster?
                            
                                Cassandra aggregation
                            
                                Comparing two uuids in Node.js
                            
                                Cassandra SSL with own Certificate Authority
                            
                                Lambda Architecture with Apache Spark
                            
                                CQL: Invalid set literal for values of type map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With