I wonder how and why "join" reduce scalability in large-scale distributed (relational) database system?
Thanks.
As a general consideration, there is significant overhead (e.g. non-user computation) in a distributed system that present a 'coherent' and 'unified' facade.
Simply consider these factors:
distinct nodes (e.g. servers) are distinct machines. This means the probability of having n nodes participating in a distributed action -- e.g. a join -- being in an optimal state (e.g. having just the right tables in cache, or having the appropriate locks acquired) is low. So here is some of the overhead for each node to get in the appropriate state.
naturally they need to communicate to coordinate. So there is network chatter between nodes and those latencies are not insignificant.
above overheads, in turn, increase the average time of servicing requests, and thus reduce availability (in terms of system capacity).
Scalability becomes an issue as none of the above are O(1). At the very best you can expect O(log n) and it could be as bad as O(n^2). That does wonders for killing scalability (which by definition means the ability of the system to scale to a larger number of nodes).
The above are a part of the motivation for noSQL systems, e.g. if one does not require coordination across nodes to service queries, then the performance is substantially better. (As you can see, it is not magic -- we're merely sacrificing systemic correctness for performance.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With