EDIT1: added a case to describe the problem after the original question. I wish to query on a column which is not part of my key. If I understand correctly, I need to define a secondary index on that column. However, I wish to use a greater than condition (not just equality condition) and that still seems unsupported. Am I missing something? How would you address this issue? My desired Setup: <pre class="prettyprint"><code>Cassandra 1.1.6 CQL3 CREATE TABLE Table1( KeyA int, KeyB int, ValueA int, PRIMARY KEY (KeyA, KeyB) ); CREATE INDEX ON Table1 (ValueA); SELECT * FROM Table1 WHERE ValueA > 3000; </code></pre> Since defining a secondary index on ColumnFamilies with Composite Keys is still not supported in Cassandra 1.1.6 I have to settle on a temporary solution of dropping one of the keys but I still have the same problem with non equality conditions. Is there another way to address this? Thank you for your time. Relevant sources: http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt http://www.datastax.com/docs/1.1/ddl/indexes <hr> EDIT1 Here's a case that will explain the problem. As rs-atl noted, it might be a data model problem. Let's say I keep a column family of all the users on stackoverflow. for each user I keep a batch of stats (Reputation, NumOfAnswers, NumOfVotes... all of them are int). I want to query on those stats to get the relevant users. <pre class="prettyprint"><code>CREATE TABLE UserStats( UserID int, Reputation int, NumOfAnswers int, . . . A lot of stats... . . . NumOfVotes int, PRIMARY KEY (UserID) ); </code></pre> Now I'm interested in slicing UserID's based on those stats. I want all the users with over 10K reputation, I want all the users with less than 5 answers, etc. etc. I hope that helps. Thanks again.

Probably the most flexible way to deal with this scenario in Cassandra will be to have a separate CF for each stat, with sentinel values as keys and the stat value in the column name, like this: <pre class="prettyprint"><code>CF: StatName { Key: SomeSentinelValue { [Value]:[UserID] = "" } } </code></pre> So let's say your stat is NumAnswers and your user IDs are strings: <pre class="prettyprint"><code>CF: NumAnswers { Key: 0 { 150:Joe = "" 200:Bob = "" 500:Sue = "" } Key: 1000 { 1020:George = "" 1300:Ringo = "" 1300:Mary = "" } } </code></pre> So you can see that your keys are essentially buckets of values, which can be as coarse or fine grain as needed for your data, and your columns are composites of value + user ID. You can now hand Cassandra a known key (or set of keys) for the coarse range you need (the equality), then do a range query on the first component of the column name. Note that you cannot write the user ID as value, because this would prevent two users from having the same count.

CQL SELECT greater-than query on indexed non-key column

Q: How do I select distinct rows in Cassandra?

In cassandra you can only select the distinct records from Partition Key column or columns. If Partition key consists of multiple columns, you have to provide all of the columns otherwise you will get an error.

Q: What is Cqlsh command?

cqlsh is a command-line interface for interacting with Cassandra using CQL (the Cassandra Query Language). It is shipped with every Cassandra package, and can be found in the bin/ directory alongside the cassandra executable.

Q: How do I select a query in Cassandra?

Cassandra provides standard built-in functions that return aggregate values to SELECT statements. A SELECT expression using COUNT(column_name) returns the number of non-NULL values in a column. A SELECT expression using COUNT(*) returns the number of rows that matched the query. Use COUNT(1) to get the same result.

Tags:

indexing

cassandra

EDIT1: added a case to describe the problem after the original question.

I wish to query on a column which is not part of my key. If I understand correctly, I need to define a secondary index on that column. However, I wish to use a greater than condition (not just equality condition) and that still seems unsupported.

Am I missing something? How would you address this issue?

My desired Setup:

Cassandra 1.1.6
CQL3

CREATE TABLE Table1(
             KeyA int,
             KeyB int,
             ValueA int,
             PRIMARY KEY (KeyA, KeyB)
           );

CREATE INDEX ON Table1 (ValueA);

SELECT * FROM Table1 WHERE ValueA > 3000;

Since defining a secondary index on ColumnFamilies with Composite Keys is still not supported in Cassandra 1.1.6 I have to settle on a temporary solution of dropping one of the keys but I still have the same problem with non equality conditions.

Is there another way to address this?

Thank you for your time.

Relevant sources: http://cassandra.apache.org/doc/cql3/CQL.html#selectStmt http://www.datastax.com/docs/1.1/ddl/indexes

EDIT1

Here's a case that will explain the problem. As rs-atl noted, it might be a data model problem. Let's say I keep a column family of all the users on stackoverflow. for each user I keep a batch of stats (Reputation, NumOfAnswers, NumOfVotes... all of them are int). I want to query on those stats to get the relevant users.

CREATE TABLE UserStats(
             UserID int,
             Reputation int,
             NumOfAnswers int,
             .
             .
             .
             A lot of stats...
             .
             .
             .
             NumOfVotes int,
             PRIMARY KEY (UserID)
           );

Now I'm interested in slicing UserID's based on those stats. I want all the users with over 10K reputation, I want all the users with less than 5 answers, etc. etc.

I hope that helps. Thanks again.

705

asked Nov 27 '12 10:11

Oren

2 Answers

In CQL, you are able to apply the WHERE clause on all columns once you have created indices for them (i.e., secondary index). Otherwise, you will get the following error:

Bad Request: No indexed columns present in by-columns clause with Equal operator

Unfortunately, even with secondary indices, the WHERE clause are required to have at least one EQ on an secondary index by CQL due to performance issue.

Q: Why is it necessary to always have at least one EQ comparison on secondary indices?

A: Inequalities on secondary indices are always done in memory, so without at least one EQ on another secondary index you will be loading every row in the database, which with a massive database isn't a good idea. So by requiring at least one EQ on an (secondary) index, you hopefully limit the set of rows that need to be read into memory to a manageable size. (Although obviously you can still get into trouble with that as well).

So basically if you have anything besides an EQ comparison, it loads all rows "that elsewise match" your query, and checks if they match, one at a time. Which is not allowed by default since it "could be slow." (In essence, indexes only index "for equality" not for anything else like < and > which indexes on a relational database would).

One thing to note is that if you have more than one non EQ conditions on secondary indices, you also need to include the ALLOW FILTERING key word in your query, or else you'll get

Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. If you want to execute this query despite the performance unpredictability, use ALLOW FILTERING

One simple way to work-around is to append a dummy column to your table where all row have the same value on that column. So in this case you are able to perform ranged query on just your desired column. Do realize that these kind of queries on a NoSQL database may be slow/bog down a system.

Example

cqlsh:demo> desc table table1;

CREATE TABLE table1 (
  keya int,
  keyb int,
  dummyvalue int,
  valuea int,
  PRIMARY KEY (keya, keyb)
) ....

cqlsh:demo> select * from Table1;

 keya | keyb | dummyvalue | valuea
------+------+------------+--------
    1 |    2 |          0 |      3
    4 |    5 |          0 |      6
    7 |    8 |          0 |      9

Create secondary indices on ValueA and DummyValue:

cqlsh:demo> create index table1_valuea on table1 (valuea);
cqlsh:demo> create index table1_valueb on table1 (dummyvalue);

Perform ranged query on ValueA with DummyValue=0:

cqlsh:demo> select * from table1 where dummyvalue = 0 and valuea > 5 allow filtering;

 keya | keyb | dummyvalue | valuea
------+------+------------+--------
    4 |    5 |          0 |      6
    7 |    8 |          0 |      9

answered Oct 14 '22 16:10

keelar

Probably the most flexible way to deal with this scenario in Cassandra will be to have a separate CF for each stat, with sentinel values as keys and the stat value in the column name, like this:

CF: StatName {
  Key: SomeSentinelValue {
    [Value]:[UserID] = ""
  }
}

So let's say your stat is NumAnswers and your user IDs are strings:

CF: NumAnswers {
  Key: 0 {
    150:Joe = ""
    200:Bob = ""
    500:Sue = ""
  }
  Key: 1000 {
    1020:George = ""
    1300:Ringo = ""
    1300:Mary = ""
  }
}

So you can see that your keys are essentially buckets of values, which can be as coarse or fine grain as needed for your data, and your columns are composites of value + user ID. You can now hand Cassandra a known key (or set of keys) for the coarse range you need (the equality), then do a range query on the first component of the column name. Note that you cannot write the user ID as value, because this would prevent two users from having the same count.

answered Oct 14 '22 17:10

rs_atl

Related questions
                            
                                pandas multi-index how to mask the data by the second level
                            
                                LATERAL JOIN not using trigram index
                            
                                Fast approximate algorithm for cardinality of sets intersection
                            
                                'XCBBuildService quit unexpectedly' Xcode 9.3 (Swift 4.1)
                            
                                Change order of pandas.MultiIndex
                            
                                Function like enumerate to get index and value for offset arrays?
                            
                                What to grant to user to use index of table while querying?
                            
                                What does Field.Index.NOT_ANALYZED_NO_NORMS mean
                            
                                Data structure with fast indexOf?
                            
                                How to get the name of a data.frame within a list?
                            
                                My simple MySql query doesn't use index
                            
                                Use savefig in Python with string and iterative index in the name
                            
                                Does FluentMigrator support creating a filtered index?
                            
                                Replace function used in index
                            
                                Index a numpy array with another array
                            
                                Azure CDN default document index.html
                            
                                How to do scatter and gather operations in numpy?
                            
                                Create indexed view with self join
                            
                                How do I Index PDF files and search for keywords?
                            
                                Soft Delete - Use IsDeleted flag or separate joiner table?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With