Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are dummy partition keys always bad?

Tags:

cassandra

I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether. By dummy, I mean a column whose only purpose is to contain the same value for all rows, thereby putting all data on 1 node and giving the lowest possible cardinality. For example:

dummy  | id   | name
-------------------------
0      | 01   | 'Oliver'
0      | 02   | 'James'
0      | 03   | 'Nicholls'

The two main points in regards to why you should avoid dummy partition keys are:

1) You end up with data "hot-spots". There is a lot of data stored on 1 node so there's more traffic around that node and you have poor distribution around the cluster.

2) Partition space is finite. If you put all data on one partition, it will eventually be incapable of storing any more data.

I can understand these points and I agree that you definitely want to avoid those situations, so I put this idea out of my mind and tried to think of a good partition key for my table. The table in question stores sites and there are two common ways that table gets queried in our system. Either a single site is requested or all sites are requested.

This puts me in a bit of an awkward situation, because the table is either queried on nothing or the site ID, and making a unique field the partition key would give me very high cardinality and high latency on queries that request all sites.

So I decided that I'd just choose an arbitrary field that would give relatively low cardinality, even though it doesn't reflect how the data will actually be queried, just because it's better than having a cardinality that is either excessively high or excessively low. This approach also has problems though.

I could partition my data on column x, but we have numerous clients, all of whom use our system differently, so x for 1 client could give the results I'm after, but could give awful results for another.

At this point I'm running out of options. I need a field in my table that will be consistent for all clients, however this field doesn't exist, so I'm now considering having a new field that will contain a random number from 1-3 and then partitioning on that field, which is essentially just a dummy field. The only difference is that I want to randomise the values a little bit as to avoid hot-spots and unbounded row growth.

I know this is a data-modelling question and it varies from system to system, and of course there are going to be situations where you have to choose the lesser of two evils (there is no perfect solution), but what I'm really focussed on with this question is:

Are dummy partition keys something that should outright never be a consideration in Cassandra, or are there situations in which they're seen as acceptable? If you think the former, then how would you approach this situation?

like image 995
Oliver Nicholls Avatar asked Jan 21 '16 12:01

Oliver Nicholls


People also ask

Is partition key same as primary key?

A primary key uniquely identifies a row. A composite key is a key formed from multiple columns. A partition key is the primary lookup to find a set of rows, i.e. a partition. A clustering key is the part of the primary key that isn't the partition key (and defines the ordering within a partition).

What is database partition key?

Partition key: A simple primary key, composed of one attribute known as the partition key. Attributes in DynamoDB are similar in many ways to fields or columns in other database systems. Partition key and sort key: Referred to as a composite primary key, this type of key is composed of two attributes.


1 Answers

I can't find much on the subject of dummy partition keys in Cassandra, but what I can find tends to side with the idea that you should avoid them altogether.

I'm going to go out on a limb and guess that your search has yielded my article We Shall Have Order!, where I made my position on the use of "dummy" partition keys quite clear. Bearing that in mind, I'll try to provide some alternate solutions.

I see two potential problems to solve here. The first:

I need a field in my table that will be consistent for all clients, however this field doesn't exist

Typically this is solved by duplicating your data into another query table. That's the best way to serve multiple, varying query patterns. If you have one client (service?) that needs to query that table by site id, then you could have that table duplicated into a table called sites_by_id.

CREATE TABLE sites_by_id (
  id BIGINT,
  name TEXT,
  PRIMARY KEY (id));

The other problem is this query pattern:

all sites are requested

Another common Cassandra anti-pattern is that of unbound SELECTs (SELECT query without a WHERE clause). I am sure you understand why these are bad, as they require all nodes/partitions to be read for completion (which is probably why you are looking into a "dummy" key). But as the table supporting these types of queries increases in size, they will only get slower and slower over time...regardless of whether you execute an unbound SELECT or use a "dummy" key.

The solution here is to perform a re-examination of your data model, and business requirements. Perhaps your data can be split up into sites by region or country? Maybe your client really only needs the sites that have been updated for this year? Obtaining some more details on the client's query requirements may help you find a good partitioning key for them to use. Otherwise, if they really do need all of them all of the time, then doanduyhai's suggestion of using Spark will better fit your use case.

like image 65
Aaron Avatar answered Sep 28 '22 00:09

Aaron