we have two entities User and Role. One User can have multiple Roles, and single Role can be shared by many users - typical m:n relation. Roles are also dynamic and we expect large amount (millions).
It is quiet simple to model such data in relational DB. I would like to find out whenever it would be possible in cassandra.
Currently I see two solutions:
A) Use normalized model and create something similar to inner-join
Create each single role in separate CF and store in User record foreign keys to referenced roles.
pro: Roles are not replicated and maintenance is simple
contra: In order to get all Roles for single User multiple network calls are necessary. User record contains only FK, Roles are stored using random partitioner, in this case each role could be stored on different cassandra node.
B) Denormalize model and replicate roles to avoid round trips In this scenario User record in cassandra contains all user roles as copy.
pro: It is possible to read User with all roles within single query. This guarantees short load times.
contra: Each shared Role is copied multiple times - on each related User. Maintaining roles is very difficult, especially if we have large data amount. For example: one Role is shared by 1000 users. Changes on this Role require update on 1000 User records. For very large data sets such updates has to be executed as asynchronous job.
Solutions above are very limited, meybie Cassandra is not right solution for m:n relations ? Do you know any cassandra design patter for such problem?
Thanks, Maciej
The way you want to design a data store in Cassandra is to start with the queries you plan to execute and make it so you can get all the information you need at once. Denormalization is the name of the game here; if you're not replicating that role information in each user node, you're not going to avoid disk seeks, and your read performance will suffer. Joins do not make sense; if you want a relational database, use a relational database.
At a guess, you're going to ask a lot of questions about what roles a user has and what they should be doing with them, so you definitely want to have role information duplicated in each user entry - probably with each role getting its own column (role-ROLE_KEY => serialized-capability-info instead of roles => [serialized array of capability info]). Your application will need some way to iterate over all those columns itself.
You will probably want to look at what users are in a role, and so you should probably store all the user information you'll need for that view in the role column family as well (though a subset of the full user record will do).
When you run updates, and add/remove users from roles, you will need to make sure that you update both the role's list of users and the user's roles at the same time. Because you're using a column for each relation, instead of a single shared serialized blob, this should work even if you're editing two different roles that share the same user at the same time: Cassandra can merge the updates, including the deletes.
If the query needs to be asynchronous, then go make your application handle it. Remember that Cassandra is an eventual-consistency data store and you shouldn't expect updates to be visible everywhere immediately anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With