alllow filtering, data modeling in cql

Q: What is CQL data Modelling?

A Pro Cycling statistics example is used throughout the CQL document. Data modeling is a process that involves identifying the entities (items to be stored) and the relationships between entities. To create your data model, identify the patterns used to access data and the types of queries to be performed.

Q: Why is allow filtering bad in Cassandra?

However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").

Tags:

cassandra

I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.

Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;

CREATE TABLE campaign_users
(
    created_at timeuuid,
    campaign_id int,
    uid bigint,
    updated_at timestamp,
    PRIMARY KEY (campaign_id, uid),
    INDEX(campaign_id, created_at)
);

In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;

select campaign_id, uid 
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;

However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?

622

asked Sep 09 '13 08:09

aacanakin

1 Answers

The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.

This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.

The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.

192

answered Oct 19 '22 20:10

Richard

Related questions
                            
                                How do you use the Cassandra tool sstableloader?
                            
                                What address should i use for listen_address in cassandra.yaml ?
                            
                                How to resolve "cassandra.cluster.NoHostAvailable" in a Python multi threaded program
                            
                                Brew Cassandra Installation
                            
                                Error: unable to connect to cassandra server. Unconfigured table
                            
                                How to fetch offset id while consuming Kafka from Spark, save it in Cassandra and use it to restart Kafka?
                            
                                ScyllaDB - [Invalid query] message="Collection filtering is not supported yet"
                            
                                How to actually set up basic Titan + Rexster + Cassandra?
                            
                                How Cassandra handles concurrent updates?
                            
                                Hadoop, Hive, Pig, HBase, Cassandra - when to use what? [closed]
                            
                                What is the maximum length of data passed to cassandra column
                            
                                What are the implications of using lightweight transactions?
                            
                                Cassandra instead of MySQL for social networking app
                            
                                How to export data from Cassandra cluster and import in another
                            
                                Cassandra : Batch write optimisation
                            
                                Row Inserts having same primary key, are replacing previous writes in Cassandra
                            
                                How can i start Apache Cassandra as a service?
                            
                                Async writes seem to be broken in Cassandra
                            
                                How to find the total space occupied by a cassandra keyspace?
                            
                                Upgrading Cassandra without losing the current data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With