Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cassandra denormalization datamodel

I read that in nosql (cassandra for instance) data is often stored denormalized. For instance see this SO answer or this website.

An example is if you have a column family of employees and departments and you want to execute a query: select * from Emps where Birthdate = '25/04/1975' Then you have to make a column family birthday_Emps and store the ID of each employee as a column. So then you can query the birthday_Emps family for the key '25/04/1975' and instantly get all the ID's of the employees born on that date. You can even denormalize the employee details into birthday_Emps as well so that you also instantly have the employee names.

Is this really the way to do it?

  1. Whenever an employee is deleted or inserted then you will have to remove the employee from birthday_Emps too. And in another example someone even said that sometimes you have a situation where one delete in some table requires like 100's of deletes in other tables. Is this really common to do?

  2. Is it common to do joins in application code? Do you have software that allows you create pre-written applications to join together data from different queries?

  3. Are there best practices, patterns, etc for handling these data model questions?

like image 329
Stefan Avatar asked Dec 03 '14 20:12

Stefan


1 Answers

"Yes" for the most part, taking an approach of query-based data modeling really is the best way to do it.

  1. That is still a good idea to do, because the speed of your query times make it worth it. Yes, there's a little more housecleaning to do. I haven't had to execute 100s of deletes from other column families, but occasionally there is some complicated clean-up to do. But, you shouldn't be doing a whole lot of deleting in Cassandra anyway (anti-pattern).

  2. No. Client-side JOINs are just as bad as distributed JOINs. The whole idea is to create a table to return data for each specific query...denormalized and/or replicated...and thus negating the need to do a JOIN at all. The exception to this, is if you are running OLAP queries for analysis, you can use a tool like Apache Spark to execute an ad-hoc, distributed JOIN. But it's definitely not something you'd want to do on a production system.

  3. A few articles I can recommend:

    • Getting Started with Cassandra Time Series Data Modeling - Written by DataStax's Chief Evangelist Patrick McFadin, it covers one of the more common Cassandra use cases in a few different ways.
    • Escaping From Disco-Era Data Modeling - This one talks about some of the obstacles that beginners with Cassandra can face, as well as the general approach to take in overcoming them. Disclaimer: I am the author.
    • Cassandra Data Modeling Best Practices, Part 1 - You can't go wrong with Jay Patel's (eBay) classic article on Cassandra modeling practices. It's a little dated in that the examples are grounded in the pre-CQL world, but the techniques still resonate.
like image 161
Aaron Avatar answered Oct 08 '22 05:10

Aaron