Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra CQL Select count with LIMIT

I created a simple tabe:

CREATE TABLE test (
  "type" varchar,
  "value" varchar,
  PRIMARY KEY(type,value)
);

I inserted 5 rows into it:

INSERT INTO test(type,value) VALUES('test','tag1')
INSERT INTO test(type,value) VALUES('test','tag2')
INSERT INTO test(type,value) VALUES('test','tag3')
INSERT INTO test(type,value) VALUES('test','tag4')
INSERT INTO test(type,value) VALUES('test','tag5')

I ran SELECT * from test LIMIT 3 and it works as expected.

 type | value
------+------
 test |  tag1
 test |  tag2
 test |  tag3

When I ran SELECT COUNT(*) from test LIMIT 3, it produces:

 count
-------
     5

Shouldn't it say 3?

The Datastax documentation seems to suggest that specifying a LIMIT will overwrite the default of 10,000. Why does it not work in this case? If it matters, I'm on Cassandra 2.2.5 and ran all the queries through cqlsh.

Update Both the Java driver and CQLSH have been tested to show that LIMIT indeed does not work as prescribed in the documentation. If there are any Datastax employees reading, your input would be greatly appreciated.

like image 470
light Avatar asked Mar 23 '16 09:03

light


3 Answers

This is a Bug in cassandra and version 2.2.x is affected by it.

https://issues.apache.org/jira/browse/CASSANDRA-8216

They have marked it as fixed, but clearly this has propagated to version beyond fixed version.

Anyways, light, Your assumption/thinking is completely correct. Limit keyword has to be applied on cassandra's count(*), and it works as it should in the versions I am working on 3.2.4 and 2.1.x

like image 64
Abhishek Anand Avatar answered Sep 30 '22 04:09

Abhishek Anand


My spontaneous response to this was that a row count always only returns one row in its result set, stating the number of rows found. So any LIMIT greater than 1 would not have an effect.

But as @light correctly pointed out, the documentation states that the LIMIT should apply to a count(*). And with good reason too. According to this blog post Cassandra cannot source any meta data to come up with the number or rows, but has to inspect every partition (on every node) to get to the number. It thus is a very expensive operation.

However, contrary to the documentation, when querying C* 2.2.4 with cqlsh or with the Java driver (v3.0.0) the LIMIT clause has no effect on the reported number of rows. Neither has the default limit of cqlsh of 10'000 rows. Nor has a LIMIT greater than 10'000 if there are more than 10'000.

The documentation and implementation seem to be out of sync. Though which one is incorrect I cannot say.

EDIT

The ticket referenced by @Abhishek Anand concludes that the documentation is wrong. Not the behavior. So specifying a limit of 1 will count all your rows. And that is the desired behavior.

like image 32
Ralf Avatar answered Sep 30 '22 06:09

Ralf


The limit clause is used to limitate the number of rows in the result. The count(*) return only a single row with the count of (in this case) total rows.

the "limit 3" does not affect the number of occurrences analyzed from the count(*), if you want this, you most be use a "where"

like image 21
mariettièllo Avatar answered Sep 30 '22 06:09

mariettièllo