I'm having trouble figuring out how to maintain attribute updates across multiple tables to ensure data consistency.
For example, suppose I have many-to-many relationship between actors and fans. A fan can support many actors, and an actor have many fans. I make several tables to support my queries
CREATE TABLE fans (
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY ((fan_id))
)
CREATE TABLE actors (
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY ((actor_id))
)
CREATE TABLE actors_by_fan (
fan_id uuid,
actor_id uuid,
actor_attr_1 int,
actor_attr_2 int
PRIMARY KEY (fan_id, actor_id)
)
CREATE TABLE fans_by_actor (
actor_id uuid,
fan_id uuid,
fan_attr_1 int,
fan_attr_2 int
PRIMARY KEY (actor_id, fan_id)
)
Let's say I'm a fan and I'm on my settings page and I want to change my fan_attr_1
to a different value.
On the fans
table I can update my attribute just fine since the application knows my fan_id and can key on that.
However I cannot change my fan_attr_1
on the fans_by_actor
without first querying for the actor_ids tied to the fan.
This problem occurs for any time you want to update any attribute of either fans
or actors
.
I've tried looking online for people experiencing similar problems, but I couldn't find them. For example, in Datastax's Data Modeling course they use the examples with actors and videos in a many to many relationship where they have tables actors_by_video
and videos_by_actor
. The course, like the other online resources I've consulted, discussed modeling tables after queries, but haven't dug into how to maintain data integrity. In the actors_by_video
table, what would happen if I want to change an actor's attribute? Wouldn't have have to go through every row of actors_by_video
to find the partitions that contain the actor and update the attribute? That sounds very inefficient. The other option is to look for the video id's beforehand, but I read elsewhere that reads before writes are an antipattern in Cassandra.
What would be the best approach for tackling this problem either from a data modeling standpoint or from a CQL standpoint?
EDIT: - Fixed sentence stubs - Added context and prior research
Data Modeling
Cassandra is not an Relational Database and there are certain basic rules need to be followed on DataModeling, at high-level the following goals need to be followed for our data model.
1) Spread data evenly around the cluster
2) Minimize the number of partitions read
Moreover we should go for a single big table rather than breaking it into multiple tables and adding relationship between the tables. In this approach duplication of records will occur. Duplication of records is not a costlier operation since it takes only a little more Disk Space rather than CPU, memory, disk IOPs, or network.
Please note that there is a size restriction on column key names and values. The maximum column key (and row key) size is 64KB. The maximum column value size is 2 GB. But becuase there is no streaming and the whole value is fetched in heap memory when requested, limit the size to only a few MBs.
More Info:
http://www.datastax.com/dev/blog/basic-rules-of-cassandra-data-modeling
http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/
https://docs.datastax.com/en/cql/3.1/cql/cql_reference/refLimits.html
CQL
Maintaining Consistency across tables can be done using Batch or Materialized Views. Materialized views is available from version 3.0
Please see
How to ensure data consistency in Cassandra on different tables?
My preference would be to change the data model and design it accordingly for our queries and if possible make it as a single big table.
Hope it Helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With