Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

alllow filtering, data modeling in cql

Tags:

cassandra

I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.

Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;

CREATE TABLE campaign_users
(
    created_at timeuuid,
    campaign_id int,
    uid bigint,
    updated_at timestamp,
    PRIMARY KEY (campaign_id, uid),
    INDEX(campaign_id, created_at)
);

In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;

select campaign_id, uid 
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;

However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?

like image 622
aacanakin Avatar asked Sep 09 '13 08:09

aacanakin


People also ask

How do I allow filtering in Cassandra?

SELECT * FROM blogs WHERE author='Jonathan Ellis' and time2 = 1418306451235; Cassandra will request ALLOW FILTERING as it will have to first find and load the rows containing Jonathan as author, and then to filter out the ones which do not have a time2 column equal to the specified value.

What is not allowed in CQL query?

CQL does not support wildcard queries. CQL does not support Union, Intersection queries. Table columns cannot be filtered without creating the index. Greater than (>) and less than (<) query is only supported on clustering column.

What is CQL data Modelling?

A Pro Cycling statistics example is used throughout the CQL document. Data modeling is a process that involves identifying the entities (items to be stored) and the relationships between entities. To create your data model, identify the patterns used to access data and the types of queries to be performed.

Why is allow filtering bad in Cassandra?

However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").


1 Answers

The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.

This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.

The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.

like image 192
Richard Avatar answered Oct 19 '22 20:10

Richard