Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Overhead and (in)efficiency of NoSQL databases?

I have a question about NoSQL type databases, in particular MongoDB, but it applies in general to most key-value or document based storages. Some of the selling points of NoSQL are speed and scalability, but it seems to me that there is significant overhead compared to relational databases.

  1. You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases. I'm more concerned about the next ones:

  2. There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.

  3. The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)

I know you can write your own views or map/reduce algorithms to get something like an index, but it seems at first glance that for the general case NoSQL must be terribly inefficient space and CPU wise.

Is it really that bad? What kinds of optimizations are in place in NoSQL databases (say MongoDB)? What's the overhead in storing lots of identical complex JSON documents compared to using a relational database?

like image 847
jdm Avatar asked Aug 30 '12 12:08

jdm


1 Answers

First, any overhead or inefficiency is more often than not simply represent a choice of priorities; an overhead somewhere gives you an advantage somewhere else.

As for your specifics points, again, I think answers will depends a lot depending on the exact NoSQL products, even among the key-value or document-based subgroup, but here some thoughts :

1- You have lots of duplication because (almost) everything is unnormalized. You can't do much about it because this is kind of the point of such databases.

Actually, most (if not all) key-value databases can be used with any schema you want. So you can have a "normalized schema" laid upon a key-value store, resulting in no duplication. Don't forget that there are SQL solutions available for some (or most?) key-value databases.

2- There is a lot of overhead because, if you have a JSON document, you have to save all the keys (and all the structural information) with each document. So for 10000 rows, you'll have to save the strings 'age', 'name', ... 10000 times.

I guess this depends on how the database engine is implemented, but compression - either complicated or simple "tokenization" - can be used and result in no significant overhead there neither.

3- The database can't do a lot of clever stuff like creating indices or binary trees (to save time) or storing integers in a compact way (because one of the free-form documents could have a string where all the others have an int, etc.)

Again, nothing prevent a key-value or document-based database from using any kind of trees under the hood or to store integers in a compact way (for example, it can have a simple binary flag to indicate if the data is stored as string or "compact integer"). As for creating indices, that is also possible (for the same reasons stated in 1, or done manually by the application).

like image 89
Laurent Parenteau Avatar answered Nov 19 '22 05:11

Laurent Parenteau