Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What to do instead of SQL joins while scaling horizontally?

What would I use instead of SQL joins when I have a large complex relational-database that just got too large to fit on a single machine? I've begun sharding the database across many machines, but as a result, I can no longer do joins efficiently.

Any tips?

like image 835
David Xu Avatar asked May 01 '14 12:05

David Xu


People also ask

Can you scale SQL horizontally?

Databases are scaled either vertically (by adding more resources to existing machines) or horizontally (by adding more machines, distributing data, and processing across those machines).

Why is horizontal scaling difficult in SQL?

Horizontal Scaling is DifficultSharding data across nodes is difficult. Achieving high availability is difficult. Handling distributed transactions is difficult. And synchronizing time across nodes is difficult.

Is sharding same as horizontal scaling?

Horizontal scaling (aka sharding) is when you actually split your data into smaller, independent buckets and keep adding new buckets as needed. A sharded environment has two or more groups of MySQL servers that are isolated and independent from each other.


2 Answers

There are many approaches to make this work, the general idea is to shard your data in such a way as to group related data.

As a simple (trivial) example, if you have a Game database, you can shard Player and PlayerGame data by the same key (playerId). If there are other tables that are related, you can add those too, think of it as a "shard tree" of related tables. Then all the data for a given Player will be guaranteed to be in the same shard. You can then perform joins within a shard, but you cannot do inner joins across shards.

The other common technique is to replicate Global tables to all shards, these are typically tables that are not updated often, but are used in lots of joins.

With these two approaches you can:

  • Join within the Shard Tree (but not a cross-shard inner join, e.g., between 2 players)
  • Join from a sharded table to a Global table at any time

Then the other trick is distributed queries, where you may need to rollup results from many shards (e.g., a count of all Players).

Here is a white paper that describes a lot of this in more detail:

http://dbshards.com/dbshards/database-sharding-white-paper/

The key to this type of approach is to understand how you want to query the data. The answer above can also be useful, to de-normalize some data when you have to query it from a different perspective. In that case you need to write the data in two (or more) formats, and partition your shards according to each structure. Again using the simple example above, let's say you need to query all the Players for a single GameInstance. Now you could make a separate "shard tree" with GameInstance as the parent and PlayerGame as the child, sharded by GameInstanceId. Now that query will be efficient too.

The goal is to have as many single shard operations as you can, as distributed operations oddly enough are generally the "evil" of a distributed database cluster.

like image 146
dbschwartz Avatar answered Oct 16 '22 12:10

dbschwartz


Depending on the data you are using, you could potentially denormalize it and spread it across different DB nodes. That would make you writes a bit more tricky, but would improve read performance.

like image 2
VHristov Avatar answered Oct 16 '22 14:10

VHristov