I know that secondary indices in Cassandra are generally a bad idea because the index is stored locally in each node i.e. not distributed across the cluster which may result in a query scanning a huge number of nodes. However, I don't understand why they are still a bad idea if I always specify the partition key in my queries and only use the secondary index as a final filter. I've read that they don't scale with large amounts of data even if I specify the partition key. Is this true? and if it's then why?
In general secondary indexes are bad idea, not only for the distributed part, but also for the index size and the number of distinct value, so if you have a field with high or low cardinality,you will be spending time on scanning many rows or many columns. Also you can have other issue while dealing with tombstones ...
To answer your question, secondary index in Cassandra doesn't scale that good, but if you use a partition key and by it you tell Cassandra which node have the data, it perform really better ! you can find more details here in section F :
https://www.datastax.com/blog/2016/04/cassandra-native-secondary-index-deep-dive
I hope this helps !
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With