<pre class="prettyprint"><code>CREATE TABLE users ( userID uuid, firstname text, lastname text, state text, zip int, age int, PRIMARY KEY (userID) ); </code></pre> I want to construct the following queries: <pre class="prettyprint"><code>select * from users where age between 30 and 40 select * from users where state in "AZ" AND "WA" </code></pre> I know I need two more tables to do this query but I dont know how the should be? EDIT From Carlo's comments, I see this is the only possibility <pre class="prettyprint"><code>CREATE TABLE users ( userID uuid, firstname text, lastname text, state text, zip int, age int, PRIMARY KEY (age,zip,userID) ); </code></pre> Now to select Users with age between 15 and 30. this is the only possibility: <pre class="prettyprint"><code>select * from users where age IN (15,16,17,....30) </code></pre> However, using IN operator here is not recommended and is anti-pattern. How about creating secondary Index on age? <pre class="prettyprint"><code>CREATE index users_age ON users(age) </code></pre> will this help? Thanks

Range queries is a prikly question. The way to perform a real range query is to use a compound primary key, making the range on the clustering part. Since the range is on clustering part you can't perform the queries you wrote: you need at least to have an equal condition on the whole partition key. Let's see an example: <pre class="prettyprint"><code>CREATE TABLE users ( mainland text, state text, uid int, name text, zip int, PRIMARY KEY ((mainland), state, uid) ) </code></pre> The uid is now an int just to make tests easier <pre class="prettyprint"><code>insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'washington', 1, 'john', 98100); insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'texas', 2, 'lukas', 75000); insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 3, 'henry', 19904); insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 19910); insert into users (mainland, state, uid, name, zip) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 20150); insert into users (mainland, state, uid, name, zip) VALUES ( 'southamerica', 'argentina', 6, 'alex', 10840); </code></pre> Now the query can perform what you need: <pre class="prettyprint"><code> select * from users where mainland = 'northamerica' and state > 'ca' and state < 'ny'; </code></pre> Output <pre class="prettyprint"><code> mainland | state | uid | name | zip -------------+----------+-----+--------+------- northamerica | delaware | 3 | henry | 19904 northamerica | delaware | 4 | dawson | 19910 </code></pre> if you put an int (age, zipcode) as first column of the clustering key you can perform the same queries comparing integers. TAKE CARE: most of people when looking at this situation starts thinking "ok, I can put a fake partition key that is always the same and then I can perform range queries". This is a huge error, the partition key is responsible for data distribution accross nodes. Setting a fix partition key means that all data will finish in the same node (and in its replica). Dividing the world zone into 15/20 zones (in order to have 15/20 partition key) is something but is not enough and is made just to create a valid example. EDIT: due to question's edit I did not say that this is the only possibility; if you can't find a valid way to partition your users and need to perform this kind of query this is one possibility, not the only one. Range queries should be performed on clustering key portion. A weak point of the AGE as partition key is that you can't perform an UPDATE over it, anytime you need to update the user's age you have to perform a delete and an insert (an alternative could be writing the birth_year/birth_date and not the age, and then calculate client side) To answer your question on adding a secondary index: actually queries on secondary index does not support IN operator. From the CQL message it looks like they're going to develop it soon <blockquote> Bad Request: IN predicates on non-primary-key columns (xxx) is not yet supported </blockquote> However even if secondary index would support IN operator your query wouldn't change from <pre class="prettyprint"><code>select * from users where age IN (15,16,17,....30) </code></pre> Just to clarify my concept: anything that does not have a "clean" and "ready" solution requires the effort of the user to model data in a way that satisfy its needs. To make an example (I don't say this is a good solution: I would not use it) <pre class="prettyprint"><code>CREATE TABLE users ( years_range text, age int, uid int, PRIMARY KEY ((years_range), age, uid) ) </code></pre> put some data <pre class="prettyprint"><code>insert into users (years_range, age , uid) VALUES ( '11_15', 14, 1); insert into users (years_range, age , uid) VALUES ( '26_30', 28, 3); insert into users (years_range, age , uid) VALUES ( '16_20', 16, 2); insert into users (years_range, age , uid) VALUES ( '26_30', 29, 4); insert into users (years_range, age , uid) VALUES ( '41_45', 41, 5); insert into users (years_range, age , uid) VALUES ( '21_25', 23, 5); </code></pre> query data <pre class="prettyprint"><code>select * from users where years_range in('11_15', '16_20', '21_25', '26_30') and age > 14 and age < 29; </code></pre> output <pre class="prettyprint"><code> years_range | age | uid -------------+-----+----- 16_20 | 16 | 2 21_25 | 23 | 5 26_30 | 28 | 3 </code></pre> This solution might solve your problem and could be used in a small cluster, where about 20 keys (0_5 ...106_110) might have a good distribution. But this solution, like the one before, does not allow an UPDATE and reduces the distribution of key. The advantage is that you have small IN sets. In a perfect world where S.I. already allows IN clause I'd use the UUID as partition key, the years_range (set as birth_year_range) as S.I. and "filter" my data client side (if interested in 10 > age > 22 I would ask for <code>IN('1991_1995', '1996_2000', '2001_2005', '2006_2010', '2011_2015')</code> calculating and removing unuseful years on my application) HTH, Carlo

how to construct range query in cassandra?

Tags:

cql

cassandra-2.0

cql3

CREATE TABLE users ( 
userID uuid, 
firstname text, 
lastname text, 
state text, 
zip int,
age int,
PRIMARY KEY (userID) 
);

I want to construct the following queries:

select * from users where age between 30 and 40

select * from users where state in "AZ" AND "WA"

I know I need two more tables to do this query but I dont know how the should be?

EDIT

From Carlo's comments, I see this is the only possibility

CREATE TABLE users ( 
userID uuid, 
firstname text, 
lastname text, 
state text, 
zip int,
age int,
PRIMARY KEY (age,zip,userID) 
);

Now to select Users with age between 15 and 30. this is the only possibility:

select * from users where age IN (15,16,17,....30)

However, using IN operator here is not recommended and is anti-pattern.

How about creating secondary Index on age?

CREATE index users_age ON users(age)

will this help?

Thanks

955

asked Jul 26 '14 07:07

brain storm

2 Answers

Range queries is a prikly question. The way to perform a real range query is to use a compound primary key, making the range on the clustering part. Since the range is on clustering part you can't perform the queries you wrote: you need at least to have an equal condition on the whole partition key. Let's see an example:

CREATE TABLE users (
  mainland text,
  state text,
  uid int,
  name text,
  zip int,
  PRIMARY KEY ((mainland), state, uid)
)

The uid is now an int just to make tests easier

insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'washington', 1, 'john', 98100);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'texas', 2, 'lukas', 75000);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 3, 'henry', 19904);
insert into users (mainland, state, uid, name, zip) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 19910);
insert into users (mainland, state, uid, name, zip) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 20150);
insert into users (mainland, state, uid, name, zip) VALUES ( 'southamerica', 'argentina', 6, 'alex', 10840);

Now the query can perform what you need:

 select * from users where mainland = 'northamerica' and state > 'ca' and state < 'ny';

Output

 mainland    | state    | uid | name   | zip
-------------+----------+-----+--------+-------
northamerica | delaware |   3 |  henry | 19904
northamerica | delaware |   4 | dawson | 19910

if you put an int (age, zipcode) as first column of the clustering key you can perform the same queries comparing integers.

TAKE CARE: most of people when looking at this situation starts thinking "ok, I can put a fake partition key that is always the same and then I can perform range queries". This is a huge error, the partition key is responsible for data distribution accross nodes. Setting a fix partition key means that all data will finish in the same node (and in its replica).

Dividing the world zone into 15/20 zones (in order to have 15/20 partition key) is something but is not enough and is made just to create a valid example.

EDIT: due to question's edit

I did not say that this is the only possibility; if you can't find a valid way to partition your users and need to perform this kind of query this is one possibility, not the only one. Range queries should be performed on clustering key portion. A weak point of the AGE as partition key is that you can't perform an UPDATE over it, anytime you need to update the user's age you have to perform a delete and an insert (an alternative could be writing the birth_year/birth_date and not the age, and then calculate client side)

To answer your question on adding a secondary index: actually queries on secondary index does not support IN operator. From the CQL message it looks like they're going to develop it soon

Bad Request: IN predicates on non-primary-key columns (xxx) is not yet supported

However even if secondary index would support IN operator your query wouldn't change from

select * from users where age IN (15,16,17,....30)

Just to clarify my concept: anything that does not have a "clean" and "ready" solution requires the effort of the user to model data in a way that satisfy its needs. To make an example (I don't say this is a good solution: I would not use it)

CREATE TABLE users (
  years_range text,
  age int,
  uid int,
  PRIMARY KEY ((years_range), age, uid)
)

put some data

insert into users (years_range, age , uid) VALUES ( '11_15', 14, 1);
insert into users (years_range, age , uid) VALUES ( '26_30', 28, 3);
insert into users (years_range, age , uid) VALUES ( '16_20', 16, 2);
insert into users (years_range, age , uid) VALUES ( '26_30', 29, 4);
insert into users (years_range, age , uid) VALUES ( '41_45', 41, 5);
insert into users (years_range, age , uid) VALUES ( '21_25', 23, 5);

query data

select * from users where years_range in('11_15', '16_20', '21_25', '26_30') and age > 14 and age < 29;

output

 years_range | age | uid
-------------+-----+-----
       16_20 |  16 |   2
       21_25 |  23 |   5
       26_30 |  28 |   3

This solution might solve your problem and could be used in a small cluster, where about 20 keys (0_5 ...106_110) might have a good distribution. But this solution, like the one before, does not allow an UPDATE and reduces the distribution of key. The advantage is that you have small IN sets.

In a perfect world where S.I. already allows IN clause I'd use the UUID as partition key, the years_range (set as birth_year_range) as S.I. and "filter" my data client side (if interested in 10 > age > 22 I would ask for IN('1991_1995', '1996_2000', '2001_2005', '2006_2010', '2011_2015') calculating and removing unuseful years on my application)

HTH, Carlo

163

answered Sep 18 '22 22:09

Carlo Bertuccini

I found that using allow filtering, I can query for range: example is here:

 CREATE TABLE users2 (
  mainland text,
  state text,
  uid int,
  name text,
  age int,
  PRIMARY KEY (uid, age, state)
) ;

insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'washington', 1, 'john', 81);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'texas', 1, 'lukas', 75);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'delaware', 1, 'henry', 19);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'northamerica', 'delaware', 4, 'dawson', 90);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'centraleurope', 'italy', 5, 'fabio', 50);
insert into users2 (mainland, state, uid, name, age) VALUES ( 'southamerica', 'argentina', 6, 'alex', 40);

select * from users2 where age>50 and age<=100 allow filtering;

    uid | age | state      | mainland     | name
-----+-----+------------+--------------+--------
   1 |  75 |      texas | northamerica |  lukas
   1 |  81 | washington | northamerica |   john
   2 |  75 |      texas | northamerica |  lukas
   4 |  90 |   delaware | northamerica | dawson

(4 rows)

I am not sure if this performance killer. But this seems to work. Infact, I don't even have to give the primary key which is uid in this case during query execution

answered Sep 17 '22 22:09

brain storm

Related questions
                            
                                Error: unable to connect to cassandra server. Unconfigured table
                            
                                ScyllaDB - [Invalid query] message="Collection filtering is not supported yet"
                            
                                What is the maximum length of data passed to cassandra column
                            
                                Alter cassandra column family primary key using cassandra-cli or CQL
                            
                                How to add multiple columns in cassandra table?
                            
                                What is the byte size of common Cassandra data types - To be used when calculating partition disk usage?
                            
                                How to change CQL version?
                            
                                Cassandra Schema for a Chat Application
                            
                                CQL check if record exists
                            
                                Cassandra 2 - list existing indexes with CQL 3
                            
                                How to create foreign key in Cassandra CQL
                            
                                Cassandra selective copy
                            
                                how UPDATE rows in cassandra using only Partition Key?
                            
                                Passing parameter to Cassandra CQL query using DataStax client
                            
                                RPC timeout in cqlsh - Cassandra
                            
                                Can I create a secondary index on multiple columns in cassandra
                            
                                How to retrieve the timestamp from cassandra?
                            
                                CQL: how to check if keyspace exists?
                            
                                Error creating table in cassandra - Bad Request: Only clustering key columns can be defined in CLUSTERING ORDER directiv
                            
                                Cassandra-CQl : Change Clustering order for Created Column Family

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to construct range query in cassandra?

Tags:

cql

cassandra-2.0

cql3

brain storm

People also ask

2 Answers

Carlo Bertuccini

brain storm

Recent Activity

Donate For Us