I'm working on a Cassandra data model to store records uploaded by users.
The potential problem is, some users may upload 50-100k rows in a 5 minute period, which can result in a "hot spot" for the partiton key (user_id). (Datastax recommendation is to rethink data model if more than 10k rows per partition).
How can I avoid having too many records on a partition key in a short amount of time?
I've tried using the Time Series suggestions from Datastax, but even if I had year, month, day, hour columns, a hot spot may still occur.
CREATE TABLE uploads (
user_id text
,rec_id timeuuid
,rec_key text
,rec_value text
,PRIMARY KEY (user_id, rec_id)
);
The use cases are:
A few possible ideas:
Use a compound partition key instead of just user_id. The second part of the partition key could be a random number from 1 to n. For example if n were 5, then your uploads would be spread out over five partitions per user instead of just one. The downside is when you do reads, you have to repeat them n times to read all the partitions.
Have a separate table to handle incoming uploads using the rec_id as the partition key. This would spread the load of uploads equally across all the available nodes. Then to get that data into the table with user_id as the partition key, periodically run a spark job to extract new uploads and add them to the user_id based table at a rate the the single partitions can handle.
Modify your front end to throttle the rate at which an individual user can upload records. If only a few users are uploading at a high enough rate to cause a problem, it may be easier to limit them rather than modify your whole architecture.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With