Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CQL SELECT greater-than query on indexed non-key column

EDIT1: added a case to describe the problem after the original question.

I wish to query on a column which is not part of my key. If I understand correctly, I need to define a secondary index on that column. However, I wish to use a greater than condition (not just equality condition) and that still seems unsupported.

Am I missing something? How would you address this issue?

My desired Setup:

Cassandra 1.1.6
CQL3

CREATE TABLE Table1(
             KeyA int,
             KeyB int,
             ValueA int,
             PRIMARY KEY (KeyA, KeyB)
           );

CREATE INDEX ON Table1 (ValueA);

SELECT * FROM Table1 WHERE ValueA > 3000;

Since defining a secondary index on ColumnFamilies with Composite Keys is still not supported in Cassandra 1.1.6 I have to settle on a temporary solution of dropping one of the keys but I still have the same problem with non equality conditions.

Is there another way to address this?

Thank you for your time.

Relevant sources: http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt http://www.datastax.com/docs/1.1/ddl/indexes


EDIT1

Here's a case that will explain the problem. As rs-atl noted, it might be a data model problem. Let's say I keep a column family of all the users on stackoverflow. for each user I keep a batch of stats (Reputation, NumOfAnswers, NumOfVotes... all of them are int). I want to query on those stats to get the relevant users.

CREATE TABLE UserStats(
             UserID int,
             Reputation int,
             NumOfAnswers int,
             .
             .
             .
             A lot of stats...
             .
             .
             .
             NumOfVotes int,
             PRIMARY KEY (UserID)
           );

Now I'm interested in slicing UserID's based on those stats. I want all the users with over 10K reputation, I want all the users with less than 5 answers, etc. etc.

I hope that helps. Thanks again.

like image 705
Oren Avatar asked Nov 27 '12 10:11

Oren


People also ask

What is not allowed in CQL query?

CQL does not support wildcard queries. CQL does not support Union, Intersection queries. Table columns cannot be filtered without creating the index. Greater than (>) and less than (<) query is only supported on clustering column.

How do I select distinct rows in Cassandra?

In cassandra you can only select the distinct records from Partition Key column or columns. If Partition key consists of multiple columns, you have to provide all of the columns otherwise you will get an error.

What is Cqlsh command?

cqlsh is a command-line interface for interacting with Cassandra using CQL (the Cassandra Query Language). It is shipped with every Cassandra package, and can be found in the bin/ directory alongside the cassandra executable.

How do I select a query in Cassandra?

Cassandra provides standard built-in functions that return aggregate values to SELECT statements. A SELECT expression using COUNT(column_name) returns the number of non-NULL values in a column. A SELECT expression using COUNT(*) returns the number of rows that matched the query. Use COUNT(1) to get the same result.


2 Answers

In CQL, you are able to apply the WHERE clause on all columns once you have created indices for them (i.e., secondary index). Otherwise, you will get the following error:

Bad Request: No indexed columns present in by-columns clause with Equal operator

Unfortunately, even with secondary indices, the WHERE clause are required to have at least one EQ on an secondary index by CQL due to performance issue.

Q: Why is it necessary to always have at least one EQ comparison on secondary indices?

A: Inequalities on secondary indices are always done in memory, so without at least one EQ on another secondary index you will be loading every row in the database, which with a massive database isn't a good idea. So by requiring at least one EQ on an (secondary) index, you hopefully limit the set of rows that need to be read into memory to a manageable size. (Although obviously you can still get into trouble with that as well).

So basically if you have anything besides an EQ comparison, it loads all rows "that elsewise match" your query, and checks if they match, one at a time. Which is not allowed by default since it "could be slow." (In essence, indexes only index "for equality" not for anything else like < and > which indexes on a relational database would).

One thing to note is that if you have more than one non EQ conditions on secondary indices, you also need to include the ALLOW FILTERING key word in your query, or else you'll get

Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

One simple way to work-around is to append a dummy column to your table where all row have the same value on that column. So in this case you are able to perform ranged query on just your desired column. Do realize that these kind of queries on a NoSQL database may be slow/bog down a system.


Example

cqlsh:demo> desc table table1;

CREATE TABLE table1 (
  keya int,
  keyb int,
  dummyvalue int,
  valuea int,
  PRIMARY KEY (keya, keyb)
) ....

cqlsh:demo> select * from Table1;

 keya | keyb | dummyvalue | valuea
------+------+------------+--------
    1 |    2 |          0 |      3
    4 |    5 |          0 |      6
    7 |    8 |          0 |      9

Create secondary indices on ValueA and DummyValue:

cqlsh:demo> create index table1_valuea on table1 (valuea);
cqlsh:demo> create index table1_valueb on table1 (dummyvalue);

Perform ranged query on ValueA with DummyValue=0:

cqlsh:demo> select * from table1 where dummyvalue = 0 and valuea > 5 allow filtering;

 keya | keyb | dummyvalue | valuea
------+------+------------+--------
    4 |    5 |          0 |      6
    7 |    8 |          0 |      9
like image 91
keelar Avatar answered Oct 14 '22 16:10

keelar


Probably the most flexible way to deal with this scenario in Cassandra will be to have a separate CF for each stat, with sentinel values as keys and the stat value in the column name, like this:

CF: StatName {
  Key: SomeSentinelValue {
    [Value]:[UserID] = ""
  }
}

So let's say your stat is NumAnswers and your user IDs are strings:

CF: NumAnswers {
  Key: 0 {
    150:Joe = ""
    200:Bob = ""
    500:Sue = ""
  }
  Key: 1000 {
    1020:George = ""
    1300:Ringo = ""
    1300:Mary = ""
  }
}

So you can see that your keys are essentially buckets of values, which can be as coarse or fine grain as needed for your data, and your columns are composites of value + user ID. You can now hand Cassandra a known key (or set of keys) for the coarse range you need (the equality), then do a range query on the first component of the column name. Note that you cannot write the user ID as value, because this would prevent two users from having the same count.

like image 44
rs_atl Avatar answered Oct 14 '22 17:10

rs_atl