Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra Performance SELECT by id or SELECT by nothing

I am wondering if C*s SELECT speed depends on how we select whole finite tables.

For example we have this table

id | value
A  | x
A  | xx
B  | xx
C  | xxx
B  | xx

Would it be faster to get all the results if we would do
SELECT * FROM Y WHERE id='A'
SELECT * FROM Y WHERE id='B'
SELECT * FROM Y WHERE id='C'

or would it be faster if we would do
SELECT * FROM Y WHERE 1

or maybe would it be faster if we would do
SELECT * FROM Y WHERE id IN ('A', 'B', 'C')

Or would they be equally fast ( if we dismiss connection time )

like image 867
M. Hirn Avatar asked Oct 16 '25 17:10

M. Hirn


1 Answers

Not sure what your column family (table) definition looks like, but your sample data could never exist like that in Cassandra. Primary keys are unique, and if id is your primary key, the last write would win. Basically, your table would look something like this:

id | value
A  | xx
C  | xxx
B  | xx

As for your individual queries...

SELECT * FROM Y WHERE 1

That might work well with 3 rows, but it won't when you have 3 million, all spread across multiple nodes.

SELECT * FROM Y WHERE id IN ('A', 'B', 'C')

This is definitely not any faster. See my answer here as to why relying on IN for anything other than occasional OLAP convenience is not a good idea.

SELECT * FROM Y WHERE id='A'
SELECT * FROM Y WHERE id='B'
SELECT * FROM Y WHERE id='C'

This is definitely the best way. Cassandra is designed to be queried by a specific, unique partitioning key. Even if you wanted to query every row in the column family (table), you're still giving it a specific partition key. That would help your driver quickly determine which node(s) to send your query to.

Now, let's say you do have 3 million rows. For your application, is it faster to query each individual one, or to just do a SELECT *? It might be faster from a query perspective, but you will still have to iterate through each one (client side). Which means managing them all within the constraints of your available JVM memory (which probably means paging them to some extent). But this is a bad (extreme) example, because there's no way you should ever want to send your client application 3 million rows to deal with.

The bottom line, is that you'll have to negotiate these issues on your own and within the specifications of your application. But in terms of performance, I've noticed that appropriate query based data modeling tends to outweigh query strategy or syntactical tricks.

like image 150
Aaron Avatar answered Oct 19 '25 14:10

Aaron



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!