Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NOSQL denormalization datamodel

Tags:

Many times I read that data in NOSQL databases is stored denormalized. For instance consider a chess game record. It may not only contain the player id's that participate in the chess game, but also the first and lastname of that player. I suppose this is done because joins are not possible in NOSQL, so if you just duplicate data you can still retrieve all the data you want in one call without manual application level processing of the data.

What I don't understand is that now when you want to update a chess-player's name, you will have to write a query that updates both the chess-game records in which that player participates as well as the player record of that player. This seems like a huge performance overhead as the database will have to search all games where that player participates in and then update each of those records.

Is it true that data is often stored denormalized like in my example?

like image 681
Stefan Avatar asked Nov 30 '14 13:11

Stefan


People also ask

What is denormalization in NoSQL?

Database denormalization is the process of optimizing your database for reads by creating redundant data. A consequence of denormalization is that insertions or deletions could cause data inconsistency if not uniformly applied to all redundant copies of the data within the database.

What is the main disadvantage of data denormalization in NoSQL database?

Disadvantages of Denormalization As data redundancy is there, update and insert operations are more expensive and take more time. Since we are not performing normalization, so this will result in redundant data. Data Integrity is not maintained in denormalization. As there is redundancy so data can be inconsistent.


2 Answers

You are correct, the data is often stored de-normalized in NoSQL databases.

The problem with the updates is partially where the term "eventual consistency" comes from.

In your example, when you update the player's name (not a common event, but it can happen), you would issue a background job to update the name across all other records. Yes, while the update is happening you may retrieve an older value, but eventually the data will be consistent. Since we're not writing ATM software here, the performance/consistency tradeoff is acceptable.

You can find more info here: http://www.allbuttonspressed.com/blog/django/2010/09/JOINs-via-denormalization-for-NoSQL-coders-Part-2-Materialized-views

like image 80
George Avatar answered Sep 27 '22 23:09

George


One way to look at it is that the number of times the user changes his/her name is extremely rare. But the number of times that board data is read and changed is immense.

So it only makes sense to optimize for a case that will happen so much more times than a case that's only happening ever so rarely.

Another point to note is that by not keeping that name data duplicated under board data, you are actually increasing the performance overhead of the read. Every time you fetch the board data, you'd have to go one more step ahead and fetch all the user data too (even if all you really wanted was just first and last name).

Again the reason to put that first name and last name on board data is probably that on the screen where the board data will be shown, you'll often be showing the user's name too.

For these reasons, you are spared to have duplicate data on NoSQL DBs. (Although this can be done in SQL DBs too but mind ya, you'll be frowned upon). Duplication in NoSQL world is fairly common and is promoted too.

like image 28
Mohammed Sadiq Avatar answered Sep 27 '22 23:09

Mohammed Sadiq