What would I use instead of SQL joins when I have a large complex relational-database that just got too large to fit on a single machine? I've begun sharding the database across many machines, but as a result, I can no longer do joins efficiently.
Any tips?
Databases are scaled either vertically (by adding more resources to existing machines) or horizontally (by adding more machines, distributing data, and processing across those machines).
Horizontal Scaling is DifficultSharding data across nodes is difficult. Achieving high availability is difficult. Handling distributed transactions is difficult. And synchronizing time across nodes is difficult.
Horizontal scaling (aka sharding) is when you actually split your data into smaller, independent buckets and keep adding new buckets as needed. A sharded environment has two or more groups of MySQL servers that are isolated and independent from each other.
There are many approaches to make this work, the general idea is to shard your data in such a way as to group related data.
As a simple (trivial) example, if you have a Game database, you can shard Player and PlayerGame data by the same key (playerId). If there are other tables that are related, you can add those too, think of it as a "shard tree" of related tables. Then all the data for a given Player will be guaranteed to be in the same shard. You can then perform joins within a shard, but you cannot do inner joins across shards.
The other common technique is to replicate Global tables to all shards, these are typically tables that are not updated often, but are used in lots of joins.
With these two approaches you can:
Then the other trick is distributed queries, where you may need to rollup results from many shards (e.g., a count of all Players).
Here is a white paper that describes a lot of this in more detail:
http://dbshards.com/dbshards/database-sharding-white-paper/
The key to this type of approach is to understand how you want to query the data. The answer above can also be useful, to de-normalize some data when you have to query it from a different perspective. In that case you need to write the data in two (or more) formats, and partition your shards according to each structure. Again using the simple example above, let's say you need to query all the Players for a single GameInstance. Now you could make a separate "shard tree" with GameInstance as the parent and PlayerGame as the child, sharded by GameInstanceId. Now that query will be efficient too.
The goal is to have as many single shard operations as you can, as distributed operations oddly enough are generally the "evil" of a distributed database cluster.
Depending on the data you are using, you could potentially denormalize it and spread it across different DB nodes. That would make you writes a bit more tricky, but would improve read performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With