I just read the DataStax post "Basic Rules of Cassandra Data Modeling" and, to sum up, we should modeling our database schema by our queries and not by our relations/objects. So, many tables can have the same duplicated data, for example users_by_email
and users_by_username
which both have the same data.
How can I handle the object update ?
For example the user edit his email, do I UPDATE
both tables manually or only INSERT
the object with all columns and don't care about previous data (which are still in my database, but with a wrong column value => email).
In case of UPDATE
, how can I handle data synchronization ?
Currently, I'm doing it manually but is there a tool to help me ? Because, possibly, I can have 5 or 6 tables with different partition/clustering keys.
I heard that Hadoop can do it, or Apache Spark.
To ensure data consistency across your many tables containing the same data, but laid out differently, it's recommended that you use a LOGGED BATCH in CQL to do the update. This way your CQL statements (updating data) in your BATCH are ACID, and you don't have to worry about some failing and retrying.
Using the linked article's schema it would look like:
BEGIN BATCH
INSERT INTO users_by_email (email, username, age) VALUES ('[email protected]', 'fromanator', 24);
INSERT INTO users_by_username (email, username, age) VALUES ('[email protected]', 'fromanator', 24);
APPLY BATCH;
This whole statement is Atomic, if one insert fails, they all fail and no change was made.
In Cassadnra, given an existing record, an update or insert using the same primary key will result in the old record marked for deletion (with a tombstone) and the new record becomes "live". There're few subtleties in the difference between Insert and Update, like counters and null values, but those are probably not relevant for the question.
Up to Cassandra 3.0, the responsibility of maintaining several views of the same data in sync is in hands of the client application. And yes, it means to insert/update the new data in all the different tables that require it.
Cassandra 3.0 introduced "Materialized Views", which lets you maintain a "master" table of the data and several views on it, all managed by Cassandra. It requires careful data modelling so that the primary key of the 'master' table contains the required entities to create the different views and related queries needed.
One additional note: If you find that your data is highly relational and requires several/many views to make it query-able, maybe Cassandra is not a good fit for the problem space and probably you should consider a RDBMS instead.
To extend on the example provided, probably user information is something we would like to keep in a relational DB, while high volume actions of those users could be registered in Cassandra. (purchases, clicks, heart rate samples, ...)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With