Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does sharding handle the joining of related tables?

When I read about sharding, looks like authors don't take into account other tables the sharded table has to be joined to (even though they describe a shard as a "subset of an original database"). However, this is a very common situation and I still don't have a clue how to handle that. Some of the authors mention "static" tables referenced by a sharded table that may be replicated to each shard (for example, Country). However, they say nothing about tables referencing the sharded one.

Imagine that we run a social network and realize that our User table (id, name) cannot fit to a single server anymore because of an enormous amount of writes or because of size (or both). So we decide to partition it horizontally into multiple shards (say, 4, so users with id 1-1000 go to one shard, 1001-2000 to another etc.) and choose a User.id as a shard key. Since the User table is routinely joined to other tables, we move records from tables referencing a given user or referenced by it to a corresponding shard (this is quite a challenge because relations are often transitive, for example, table A may reference B which references the sharded table C). In order to simplify things, we can decide to replicate all but the User table to all shards in their entirety. So far so good.

Then, imagine the Friends table (id, user_id, friend_id) containing information regarding who is a friend of who and referencing the User table. A user 1001 has 2 friends, 2002 and 3003, and they are located on different shards. So if we need to fetch information about the user 1001 friends, we will have to perform 2 cross-shard joins. Even if we managed to places all related users on the same shard initially, a user can add a new friend from a different shard. We cannot move this friend 4004 to the user 1001 because other users from the same shard #5 can also have him as a friend.

To be honest, I cannot figure out how situations like this are handled when sharding is performed and I haven't seen any resources explaining that.

like image 726
raiks Avatar asked Nov 24 '17 11:11

raiks


People also ask

Can you join across shards?

If there are other tables that are related, you can add those too, think of it as a "shard tree" of related tables. Then all the data for a given Player will be guaranteed to be in the same shard. You can then perform joins within a shard, but you cannot do inner joins across shards.

How does database sharding work?

What is database sharding? Sharding is a method for distributing a single dataset across multiple databases, which can then be stored on multiple machines. This allows for larger datasets to be split into smaller chunks and stored in multiple data nodes, increasing the total storage capacity of the system.

Can sharding be done on relational database?

Sharding, also known as horizontal partitioning, is a popular scale-out approach for relational databases. Amazon Relational Database Service (Amazon RDS) is a managed relational database service that provides great features to make sharding easy to use in the cloud.

How does sharding work in SQL server?

A shard is a data store in its own right (it can contain the data for many entities of different types), running on a server acting as a storage node. This pattern has the following benefits: You can scale the system out by adding further shards running on additional storage nodes.


1 Answers

When doing a join across sharded tables what you generally want to optimize for is the amount of data being transferred across the shards.

There are 5 types of distributed joins, as explained here, ordered from most preferred to least:

  1. Local/Collocated Reference Table Join

This is the example you mentioned with the Countries table. Each node keeps a copy of this table locally because it's rarely updated. The pros and cons are obvious: not everything can be marked as a reference table, but there is no data movement involved.

  1. Local/Collocated Distributed Table Join

This means sending a query to all nodes where the data required for the join is located. Then, the overall execution result is aggregated. The con is that the tables need to be sharded on the columns involved in the join condition. This is the most scalable algorithm as it involves no data movement before doing the join.

This would work with your Friends table example. Presumably, your Users table will be keyed by the User id, which is also the shard key, so there will be an index, so queries should be fast.

  1. Remote Distributed Table Join

All nodes in the cluster send the data they have for the two sides of the join to a single node so it can run the join. This type of join only performs well when the number of rows involved in the join are small.

  1. Broadcast Join

If you're doing a join where one of the sides has a large dataset and the other one has a small dataset, this type sends the small dataset to the larger one, and the node with the large dataset does the join locally.

  1. Shuffle Join

This is the most expensive but flexible way of running a distributed join. A lot of data movement is needed as many of the rows involved in the join are copied to other nodes to execute the join.

like image 197
Maria Ines Parnisari Avatar answered Sep 29 '22 05:09

Maria Ines Parnisari