Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is the IN relation in Cassandra bad for queries?

Given an example of the following select in CQL:

SELECT * FROM tickets WHERE ID IN (1,2,3,4)

Given ID is a partition key, is using IN relation better than doing multiple queries or is there no difference?

like image 566
Andy Leung Avatar asked Nov 18 '14 16:11

Andy Leung


2 Answers

Yes, its better to query individually than using IN in Cassandra.

For this query, the coordinator has to get the data from 4 different partitions and if each partition is very big then the data gets filled in JVM which can cause problem.

Instead querying the data using multiple queries is better as each query is individual and don't have to wait for other partitions data to send it back to user.

like image 136
Nandakishore Avatar answered Nov 06 '22 07:11

Nandakishore


I remembered seeing someone answer this question in the Cassandra user mailing list a short while back, but I cannot find the exact message right now. Ironically, Cassandra Evangelist Rebecca Mills just posted an article that addresses this issue (Things you should be doing when using Cassandra drivers...points #13 and #22). But the answer is "yes" that in some cases, multiple, parallel queries would be faster than using an IN. The underlying reason can be found in the DataStax SELECT documentation.

When not to use IN

...Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.

So based on that, it would seem that this becomes more of a problem as your cluster gets larger.

Therefore, the best way to solve this problem (and not have to use IN at all) would be to rethink your data model for this query. Without knowing too much about your schema, perhaps there are attributes (column values) that are shared by ticket IDs 1, 2, 3, and 4. Maybe using something like level or group (if tickets are for a particular venue) or maybe even an event (id), instead.

Basically, while using a unique, high-cardinality identifier to partition your data sounds like a good idea, it actually makes it harder to query your data (in Cassandra) later on. If you could come up with a different column to partition your data on, that would certainly help you in this case. Regardless, creating a new, specific column family (table) to handle queries for those rows is going to be a better approach than using IN or multiple queries.

like image 24
Aaron Avatar answered Nov 06 '22 07:11

Aaron