Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choosing a partition key for a Cassandra table -- how many is too many partitions?

I have an application where the 'natural' partition key for a Cassandra table seems like it would be 'customer'. This is the primary way we want to query the data, we would get good data distribution, etc.

But if there were well over 1 million customers, would that be too many different partitions?

Should I choose a partition key that results in a smaller number of partition keys?

I've looked at a number of the related questions on this topic but none seem to address this particular point.

like image 574
Kevin Bedell Avatar asked Jun 04 '15 15:06

Kevin Bedell


2 Answers

I think you misunderstand how the partition key is used. The recommended partitioner takes your partition key values and then computes a 128 bit hash from them. The hash is called the token of the record, and it is that token value that determines where your record is stored. Each Cassandra node has a set of token ranges associated with it. If the token of a record falls with a range of a node, the record is stored on that node. The number of partitions is not determined by your choice of partition key: it is the number of token ranges in your cluster. That is roughly equal to the total number of vnodes you selected when you configured your data store nodes.

like image 165
Raedwald Avatar answered Sep 19 '22 15:09

Raedwald


But if there were well over 1 million customers, would that be too many different partitions?

No. The Murmur3Partitioner can handle something like 2^64 (-2^63 to +2^63) partitions. Cassandra is designed to be very good at storing large amounts of data and retrieving by partition key. There are restrictions on the number of columns within a partition (2 billion), but for total number of partitions I think you'll be fine with what you have.

Should I choose a partition key that results in a smaller number of partition keys?

Definitely not. That could cause your partitions to grow too big, and/or develop "hot spots" in your cluster.

The main task behind picking a good partition key, is to find one that (both) offers good data distribution in the cluster, and matches your query patterns. And from what I'm reading, it sounds like you have done exactly that.

like image 24
Aaron Avatar answered Sep 20 '22 15:09

Aaron