Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra table synchronization

I just read the DataStax post "Basic Rules of Cassandra Data Modeling" and, to sum up, we should modeling our database schema by our queries and not by our relations/objects. So, many tables can have the same duplicated data, for example users_by_email and users_by_username which both have the same data.

How can I handle the object update ?
For example the user edit his email, do I UPDATE both tables manually or only INSERT the object with all columns and don't care about previous data (which are still in my database, but with a wrong column value => email).

In case of UPDATE, how can I handle data synchronization ?
Currently, I'm doing it manually but is there a tool to help me ? Because, possibly, I can have 5 or 6 tables with different partition/clustering keys.
I heard that Hadoop can do it, or Apache Spark.

like image 236
Divi Avatar asked May 13 '15 16:05

Divi


2 Answers

To ensure data consistency across your many tables containing the same data, but laid out differently, it's recommended that you use a LOGGED BATCH in CQL to do the update. This way your CQL statements (updating data) in your BATCH are ACID, and you don't have to worry about some failing and retrying.

Using the linked article's schema it would look like:

BEGIN BATCH
  INSERT INTO users_by_email (email, username, age) VALUES ('[email protected]', 'fromanator', 24);
  INSERT INTO users_by_username (email, username, age) VALUES ('[email protected]', 'fromanator', 24);
APPLY BATCH;

This whole statement is Atomic, if one insert fails, they all fail and no change was made.

like image 152
fromanator Avatar answered Nov 16 '22 02:11

fromanator


In Cassadnra, given an existing record, an update or insert using the same primary key will result in the old record marked for deletion (with a tombstone) and the new record becomes "live". There're few subtleties in the difference between Insert and Update, like counters and null values, but those are probably not relevant for the question.

Up to Cassandra 3.0, the responsibility of maintaining several views of the same data in sync is in hands of the client application. And yes, it means to insert/update the new data in all the different tables that require it.

Cassandra 3.0 introduced "Materialized Views", which lets you maintain a "master" table of the data and several views on it, all managed by Cassandra. It requires careful data modelling so that the primary key of the 'master' table contains the required entities to create the different views and related queries needed.

One additional note: If you find that your data is highly relational and requires several/many views to make it query-able, maybe Cassandra is not a good fit for the problem space and probably you should consider a RDBMS instead.

To extend on the example provided, probably user information is something we would like to keep in a relational DB, while high volume actions of those users could be registered in Cassandra. (purchases, clicks, heart rate samples, ...)

like image 22
maasg Avatar answered Nov 16 '22 02:11

maasg