Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why don't most NoSQL DBMSs have “pointers”?

What is the objective reason fo why don't most NoSQL storage solutions have some kind of "pointers" for ultra-efficient joins, like the pre-relational DBMSs had?

I mean, I partially understand the theoretical reasons for why classical RDBMSs have ditched pointers (need to update them and double sync them for memory and disk, no "disks" fast enough to be treatable like random-access for some use cases, like modern SSDs can, etc.).

But of the many NoSQL solutions out there, why do just so few of them realize that this model would be awesome (exception I know of would be OrientDB and Neo4j) for many practical cases, not only ones that need graph traversals. I mean, when you need things like multi-joins, you need to ping pong Mongo and do N queries instead of one.

Isn't the use case of a NoSQL document-db overlapping enough with the one of graph DBs that such a feature would make sense and would just provide all the practical features of SQL-joins to the NoSQL solutions with not much extra cost, and for most queries would make indexes useless, and take up much less space for huge datasets?

(...and as a bonus any NoSQL solution would be ready to use as a graph db, and doing a ~100 nodes path length traversal of a graph stored in Mongo would just automagically work fast enough)

like image 911
NeuronQ Avatar asked Sep 29 '22 14:09

NeuronQ


1 Answers

I believe the key problem is data locality and horizontal scalability. A premise of NoSQL is that the read-heavy models of RBDMSs, i.e. those that require joins, lead to bottlenecks.

Think of Twitter: the original data model was read-heavy, but the joins you need to make are insanely large (billions of tweets x hundreds of millions of users x tens of billions of follower-followee relations that are wildly varying in size [1-10M, or whatever aplusk has these days]).

When even the ids you'll want to join don't fit in a reasonable machine's RAM, calculating the overlap of ids becomes terribly expensive. If you take the actual data into account, horizontal scalability becomes next to impossible because there's no a priori knowledge which shards / machines will need to be hit. Storing all follower pointers in every follower-list would require insane bookkeeping for trivial changes, while not exploiting creation-time locality (or at least, creation-time locality per feed).

In a multi-tenant application, you can always shard by the tenants, or by the sales region or by agents or maybe even by time: You can find some locality criterion that is good for like > 95% of the cases.

With graphs, that becomes a lot more complicated, especially those which have certain connection properties (scale-free networks with small diameter / small world phenomenon): A simple post, say by a celebrity, can quickly spread through a large portion of the entire network, meaning that practically every query must hit the one node that holds the post.

Sure, the post itself would be cached by the web servers, but add likes and comments, or favorites and retweets and the story becomes a nightmare (writes!) Add in notification emails, content ranking and filtering and you're in true horror.

doing a ~100 nodes path length traversal of a graph stored in Mongo would just automagically work fast enough

If that data happens to be on 100 different nodes, the sheer network overhead will be in the range of 50ms, even in a single datacenter with no congestion and idle machines. If this spreads across the world or individual queries take a little longer, you'll quickly end up at 5000ms. Also, the query would fail if only one machine is down.

This depends too much on the details of the network, which is why the problem should be solved by application code, not by the data store.

when you need things like multi-joins, you need to ping pong Mongo and do N queries instead of one

When you need multi-joins in MongoDB, you're using the wrong tool for your data model, or vice versa. Multi-Join means normalized means read-heavy which battles the key concept of MongoDB. However, you can store quite large association lists even in MongoDB. But the tool becomes almost irrelevant here: If you look at Facebook TAO, for instance, there's little technology dependence in that.

like image 61
mnemosyn Avatar answered Oct 03 '22 02:10

mnemosyn