Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra Allow filtering

I have a table as below

CREATE TABLE test (
 day int,
 id varchar,  
 start int,
 action varchar,  
 PRIMARY KEY((day),start,id)
);

I want to run this query

Select * from test where day=1 and start > 1475485412 and start < 1485785654 
and action='accept' ALLOW FILTERING

Is this ALLOW FILTERING efficient?

I am expecting that cassandra will filter in this order

1. By Partitioning column(day)
2. By the range column(start) on the 1's result
3. By action column on 2's result. 

So the allow filtering will not be a bad choice on this query.

In case of the multiple filtering parameters on the where clause and the non indexed column is the last one, how will the filter work? Please explain.

like image 214
Bharathi Avatar asked Mar 03 '17 10:03

Bharathi


People also ask

What is allow filtering in Cassandra?

Cassandra will request ALLOW FILTERING as it will have to first find and load the rows containing Jonathan as author, and then to filter out the ones which do not have a time2 column equal to the specified value. Adding an index on time2 might improve the query performance.

Does Cassandra support sorting?

Cassandra supports sorting using the clustering columns. When you create a table, you can define clustering columns which will be used to sort the data inside each partition in either ascending or descending orders. Then you can easily use the ORDER BY clause with the ASC or DESC options.

When should you not use Cassandra?

When you want many-to-many mappings or join tables. Cassandra doesn't support a relational schema with foreign keys and join tables. So if you want to write a lot of complex join queries, then Cassandra might not be the right database for you.

How do I limit the number of rows in Cassandra?

The LIMIT option sets the maximum number of rows that the query returns: SELECT lastname FROM cycling. cyclist_name LIMIT 50000; Even if the query matches 105,291 rows, Cassandra only returns the first 50,000.


2 Answers

Is this ALLOW FILTERING efficient?

When you write "this" you mean in the context of your query and your model, however the efficiency of an ALLOW FILTERING query depends mostly on the data it has to filter. Unless you show some real data this is a hard to answer question.

I am expecting that cassandra will filter in this order...

Yeah, this is what will happen. However, the inclusion of an ALLOW FILTERING clause in the query usually means a poor table design, that is you're not following some guidelines on Cassandra modeling (specifically the "one query <--> one table").

As a solution, I could hint you to include the action field in the clustering key just before the start field, modifying your table definition:

CREATE TABLE test (
 day int,
 id varchar,  
 start int,
 action varchar,  
 PRIMARY KEY((day),action,start,id)
);

You then would rewrite your query without any ALLOW FILTERING clause:

SELECT * FROM test WHERE day=1 AND action='accept' AND start > 1475485412 AND start < 1485785654

having only the minor issue that if one record "switches" action values you cannot perform an update on the single action field (because it's now part of the clustering key), so you need to perform a delete with the old action value and an insert it with the correct new value. But if you have Cassandra 3.0+ all this can be done with the help of the new Materialized View implementation. Have a look at the documentation for further information.

like image 63
xmas79 Avatar answered Sep 20 '22 19:09

xmas79


In general ALLOW FILTERING is not efficient.

But in the end it depends on the size of the data you are fetching (for which cassandra have to use ALLOW FILTERING) and the size of data its being fetched from.

In your case cassandra do not need filtering upto :

  1. By the range column(start) on the 1's result

As you mentioned. But after that, it will rely on filtering to search data, which you are allowing in query itself.

Now, keep following in mind

If your table contains for example a 1 million rows and 95% of them have the requested value, the query will still be relatively efficient and you should use ALLOW FILTERING.

On the other hand, if your table contains 1 million rows and only 2 rows contain the requested value, your query is extremely inefficient. Cassandra will load 999, 998 rows for nothing. If the query is often used, it is probably better to add an index on the time1 column.

So ensure this first. If it works in you favour, use FILTERING. Otherwise, it would be wise to add secondary index on 'action'.

PS : There is some minor edit.

like image 39
r005t3r Avatar answered Sep 19 '22 19:09

r005t3r