Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra IN clause on index

Assume a simple table with one insert (or without this insert, doesn't really matter).

CREATE TABLE test (
 x int,
 y int,
 z int,
 PRIMARY KEY (x, y, z)
);


create index z_index on test (z);

insert into test(x, y, z) values (1,2,3);

I struggle to understand why I cannot query with an in clause on the index z:

cqlsh:test> select * from test where z in (3);
Bad Request: PRIMARY KEY part z cannot be restricted (preceding part y is either not restricted or by a non-EQ relation)

It is possible with a simple equals predicate:

cqlsh:test> select * from test where z = 3;

 x | y | z
---+---+---
 1 | 2 | 3

(0 rows)

I thought having an index on z would keep a mapping from specific values of z to rows but this assumption seems wrong.

Why this doesn't work the way I expected? I guess the index works differently.

EDIT: I am using [cqlsh 4.1.1 | Cassandra 2.0.6 | CQL spec 3.1.1 | Thrift protocol 19.39.0]

like image 372
CatInABox Avatar asked May 26 '14 16:05

CatInABox


People also ask

Can we use in clause in Cassandra?

Using the SELECT command with the IN keyword. The IN keyword can define a set of clustering columns to fetch together, supporting a "multi-get" of CQL rows. A single clustering column can be defined if all preceding columns are defined for either equality or group inclusion.

Can we index in Cassandra?

An index provides a means to access data in Cassandra using attributes other than the partition key for fast, efficient lookup of data matching a given condition. An index provides a means to access data in Cassandra using attributes other than the partition key.

Does Cassandra support multiple indexes?

Indexes can be created on multiple columns and used in queries. The general rule about cardinality applies to all columns indexed.

How do I enable Sasi index in Cassandra?

To create SASI indexes use CQLs CREATE CUSTOM INDEX statement: cqlsh:demo> CREATE CUSTOM INDEX ON sasi (first_name) USING 'org. apache. cassandra.


1 Answers

While the documentation on DataStax is normally really good, I couldn't find anything discussing the details behind this. I did however, run across this article called A deep look at the CQL WHERE clause. The second section titled "WHERE clause restrictions for SELECT statements" has several paragraphs discussing the IN restriction.

To paraphrase, it basically says this:

For single column keys, the IN operator is allowed without restriction. For composite primary keys, I have to use the = operator on the first N-1 columns of the partition key in order to use the IN operator on the last column.

In your case, x is your partition key, which means that x is the only column that will support the IN operator of CQL. If you really do need to be able to support IN queries on column z, then you would have to de-normalize your data, and build a (redundant) table designed to support that query. For instance:

CREATE TABLE test (
 x int,
 y int,
 z int,
 PRIMARY KEY (z)
);

...would support the query, but the values of z may not be unique. In that case, you could define x and/or y as a LIST<int> and that would do it.

Also, DataStax does have documentation available on when not to use an index, and they state that the same conditions apply to the use of the IN operator.

Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.

like image 66
Aaron Avatar answered Sep 24 '22 13:09

Aaron