Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra Data sync issues

Tags:

cassandra

I'm Researching on Cassandra for over 2 weeks just have the full grasp on the same. I've read almost all the web about Cassandra and still not clear over some concepts. Following are the ones:-

As per the documentation :- We model our Column Families as per our queries. Hence we need to know our queries before-hand, which is not at all possible in a real world scenario. We can have a certain set of queries before-hand, which all keeps changing with time. Hence if I'd designed a model based on my previous queries, then after a new requirement comes i, I need to redesign a the model. And as read over one SO thread It’s very hard to fix a bad Cassandra data model in the future. For Eg:- I'd a user model having fields say

name, age,phone,imei,address, state,city,registration_type, created_at

Currently, I need to filter by (lets say) only by state. I'll make a PK as state. Lets name the model UserByState. Now after 2-3 months, I came with a requirement of filtering by created_at. Now I'll create a model UserByCreatedAt with PK as created_at.

Now there are 2 problems:-

a) If I create a new model when the requirement comes in, then I need to migrate the data into the new model, ie if I create a new model, I need to have the previous data in the current model as well. Hence I need to migrate the data from UserByState to UserByCreatedAt, ie I need to write a script to copy the data from UserByState to UserByCreatedAt. Correct me if Im wrong!!!

If another new filtering requirement comes in, I'll be creating new models and then migration and so on.

b) To create models before-hand as per the queries, I need to keep data in sync, ie in the above case of Users, I created 2 models for 2 queries.

UserByState and UserByCreatedAt

So do I need to apply 2 different write queries??, ie

UserByState.create(row = value,......)
UserByCreatedAt.create(row = value,......)

And if I've other models, such as 'UserByGender' and so on. do I need to apply different write queries to different models MANUALLY or does it happen on its own??? The problem of keeping the data in sync arises.

like image 259
PythonEnthusiast Avatar asked Mar 03 '26 23:03

PythonEnthusiast


1 Answers

There is no free lunch in distributed systems and you've hit some of key limitations on the head.

If you want extremely performant writes that scale horizontally you end up having to make concessions on other pats of the database. Cassandra chose to sacrifice flexibility in query patterns to ensure extremely fast access to well defined query patterns.

When most users reach a situation where they need to have to extremely different and frequent query patterns, they build a second table and update both at once. To get atomicity with the multi-table writes, logged batching can be used to make sure that either all of the data is written or none of it is. Logged batching increases the cost so this is still yet another tradeoff with performance. Beyond that the normal consistency level tradeoffs all still apply.

For moving data from the old table to the new one Hadoop/Spark are good options. These are batch based systems so they will not provide low latency but are great for one-offs like rebuilding a table with a new index and cronjob operations.

like image 191
RussS Avatar answered Mar 05 '26 19:03

RussS