Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

high and low cardinality in Cassandra

I keep coming across these terms: high cardinality and low cardinality in Cassandra.

I don't understand what exactly they mean. what effects they have on queries and what is preferred. Please explain with example since that will be be easy to follow.

like image 539
eagertoLearn Avatar asked Aug 03 '14 02:08

eagertoLearn


People also ask

What is cardinality in Cassandra?

The cardinality of X is nothing more than the number of elements that compose X. In Cassandra the partition key cardinality is very important for partitioning data.

What is low cardinality and high cardinality?

Low cardinality refers to a database that has a lot of repeated values like status flags, Boolean values, or gender. In contrast, high cardinality refers to a database that has a large number of distinct values such as ID numbers, user names or email addresses.

Why high cardinality is a problem?

A categorical feature is said to possess high cardinality when there are too many of these unique values. One-Hot Encoding becomes a big problem in such a case since we have a separate column for each unique value (indicating its presence or absence) in the categorical variable.


1 Answers

The cardinality of X is nothing more than the number of elements that compose X. In Cassandra the partition key cardinality is very important for partitioning data.

Since the partition key is responsible for the distribution of the data across the cluster, choosing a low cardinality key might lead to a situation in which your data are not distributed.

Imagine you have a cluster of 20 nodes storing comments -- the RF is 2. Each comment has it's own vote going from 1 to 5. Now, since you want to easily retrieve comments by votes, you might be tempted to choose vote as partition key.

CREATE TABLE comments(vote int, content text, id uuid, PRIMARY KEY(vote, id));

In this situation the only key responsible for data distribution is vote, which has a very low cardinality since it can contains only 5 values (1,2,3,4,5). This means that, in the best situation 5 different nodes will be the owners of the 5 different partitions (which are "all comments with vote 1" ... "all comments with vote 5"), and again in best situation, with a RF of 2, 10 different nodes will hold your data. As you can see you have a 20 nodes cluster which isn't used more than 50% in best situation.

Data distribution is very important, that's why partition key cardinality matters a lot

HTH, Carlo

like image 94
Carlo Bertuccini Avatar answered Nov 04 '22 16:11

Carlo Bertuccini