Cassandra

Question

I'm working on a Cassandra data model to store records uploaded by users.

The potential problem is, some users may upload 50-100k rows in a 5 minute period, which can result in a "hot spot" for the partiton key (user_id). (Datastax recommendation is to rethink data model if more than 10k rows per partition).

How can I avoid having too many records on a partition key in a short amount of time?

I've tried using the Time Series suggestions from Datastax, but even if I had year, month, day, hour columns, a hot spot may still occur.

CREATE TABLE uploads (
    user_id text
   ,rec_id timeuuid
   ,rec_key text
   ,rec_value text
   ,PRIMARY KEY (user_id, rec_id)
);

The use cases are:

Get all upload records by user_id
Search for upload records by date range range

Jim Meyer · Accepted Answer

A few possible ideas:

Use a compound partition key instead of just user_id. The second part of the partition key could be a random number from 1 to n. For example if n were 5, then your uploads would be spread out over five partitions per user instead of just one. The downside is when you do reads, you have to repeat them n times to read all the partitions.
Have a separate table to handle incoming uploads using the rec_id as the partition key. This would spread the load of uploads equally across all the available nodes. Then to get that data into the table with user_id as the partition key, periodically run a spark job to extract new uploads and add them to the user_id based table at a rate the the single partitions can handle.
Modify your front end to throttle the rate at which an individual user can upload records. If only a few users are uploading at a high enough rate to cause a problem, it may be easier to limit them rather than modify your whole architecture.

Cassandra - Data Modeling Time Series - Avoiding "Hot Spots"?

Tags:

data-modeling

nosql

datastax

user2966445

1 Answers

Jim Meyer

Recent Activity

Donate For Us