What to do instead of SQL joins while scaling horizontally?

Tags:

What would I use instead of SQL joins when I have a large complex relational-database that just got too large to fit on a single machine? I've begun sharding the database across many machines, but as a result, I can no longer do joins efficiently.

Any tips?

835

asked May 01 '14 12:05

David Xu

2 Answers

There are many approaches to make this work, the general idea is to shard your data in such a way as to group related data.

As a simple (trivial) example, if you have a Game database, you can shard Player and PlayerGame data by the same key (playerId). If there are other tables that are related, you can add those too, think of it as a "shard tree" of related tables. Then all the data for a given Player will be guaranteed to be in the same shard. You can then perform joins within a shard, but you cannot do inner joins across shards.

The other common technique is to replicate Global tables to all shards, these are typically tables that are not updated often, but are used in lots of joins.

With these two approaches you can:

Join within the Shard Tree (but not a cross-shard inner join, e.g., between 2 players)
Join from a sharded table to a Global table at any time

Then the other trick is distributed queries, where you may need to rollup results from many shards (e.g., a count of all Players).

Here is a white paper that describes a lot of this in more detail:

http://dbshards.com/dbshards/database-sharding-white-paper/

The key to this type of approach is to understand how you want to query the data. The answer above can also be useful, to de-normalize some data when you have to query it from a different perspective. In that case you need to write the data in two (or more) formats, and partition your shards according to each structure. Again using the simple example above, let's say you need to query all the Players for a single GameInstance. Now you could make a separate "shard tree" with GameInstance as the parent and PlayerGame as the child, sharded by GameInstanceId. Now that query will be efficient too.

The goal is to have as many single shard operations as you can, as distributed operations oddly enough are generally the "evil" of a distributed database cluster.

146

answered Oct 16 '22 12:10

dbschwartz

Depending on the data you are using, you could potentially denormalize it and spread it across different DB nodes. That would make you writes a bit more tricky, but would improve read performance.

answered Oct 16 '22 14:10

VHristov

Related questions
                            
                                Error when trying to migrate postgresql db to mysql with workbench
                            
                                What does this error mean in mariadb ssl :: ERROR 2026 (HY000): SSL connection error: SSL_CTX_set_default_verify_paths failed
                            
                                'SET foreign_key_checks = 1' does not work again
                            
                                Golang ORDER BY issue with MySql
                            
                                Remove all non-numeric characters from a field
                            
                                How to connect R to MySQL? Failed to connect to database: Error: Plugin caching_sha2_password could not be loaded
                            
                                What is the best datatype for currencies in MySQL?
                            
                                How do I ask for help optimizing & fixing queries in MySQL?
                            
                                Deleting duplicates from a large table
                            
                                Would you use one or two tables for username and password?
                            
                                Using MySQL databases in Mathematica
                            
                                Why is mySQL query, left join 'considerably' faster than my inner join
                            
                                Merging multiple rows into one row and multiple columns on mysql
                            
                                Most efficient way to search in SQL?
                            
                                How can I select the longest text field when using GROUP BY in mysql, a la MAX()?
                            
                                EC2 MySQL crashing continuously
                            
                                Fastest way to update a MySQL table if row exists else insert. More than 2 non-unique keys
                            
                                Django ORM query GROUP BY multiple columns combined by MAX
                            
                                JDBC url for MySQL configuration to use utf8 character encoding
                            
                                Query gives #1305 - FUNCTION database-name.LEN does not exist; WHY?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What to do instead of SQL joins while scaling horizontally?

Tags:

mysql

scalability

sharding

David Xu

People also ask

2 Answers

dbschwartz

VHristov

Recent Activity

Donate For Us