Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use horizontal partitioning and when to use database sharding?

I'm reading this article on Wikipedia: http://en.wikipedia.org/wiki/Shard_(database_architecture) trying to find the major difference between these 2 techniques. Here is what I found:

Horizontal partitioning splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size (and thus search effort) provided that there is some obvious, robust, implicit way to identify in which table a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.

Sharding goes beyond this: it partitions the problematic table(s) in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers (logical or physical), not just multiple indexes on the same logical server.

As I understood, horizontal partitioning is more applicable for single instance (single node environment) whereas sharding is used in multi-node / multiple data-center environment. Is this correct? Or are there different usage scenarios ?

Extra question: For a huge table (with millions of rows) with simple schema (about 4-5 columns) , what is the best technique to improve read/write performance on this table ?

like image 283
Dio Phung Avatar asked Jun 01 '14 08:06

Dio Phung


People also ask

Is sharding the same as horizontal partitioning?

Sharding is actually a type of database partitioning, more specifically, Horizontal Partitioning. Sharding, is replicating [copying] the schema, and then dividing the data based on a shard key onto a separate database server instance, to spread load. Every distributed table has exactly one shard key.

When would you recommend sharding a database?

For an application with a large, monolithic database, queries can become prohibitively slow. By sharding one table into multiple, though, queries have to go over fewer rows and their result sets are returned much more quickly.

What is the difference between sharding and partitioning?

Sharding and partitioning are both about breaking up a large data set into smaller subsets. The difference is that sharding implies the data is spread across multiple computers while partitioning does not. Partitioning is about grouping subsets of data within a single database instance.

When would horizontal partitioning make the most sense?

If there is a group of rows that are accessed together significantly more frequently than the rest of the rows in a table, then horizontal partitioning may make sense. Horizontal Partitioning: When you draw a horizontal line, the table's records are split into two identical tables.


1 Answers

You are correct, horizontal partition (supported for example in MySQL and PostgreSQL) splits a table up within a single server. This can improve performance because data and indexes can be split across many disk volumes, improving I/O. This is usually done with a key range.

With database sharding, you are dividing the data across multiple servers, not just within a single server. In this case you use a shard key to partition the data, typically with some sort of hashing algorithm. You can get a white paper on this subject here (offered by our company, it is not specific to any product, it explains the technology): http://www.codefutures.com/database-sharding-white-paper/

The advantage of DBMS single server partitioning is that it is relatively simple to set up and manage. The disadvantage is ultimately you are limited by what a single server can do. This is particularly the case when it comes to heavy write contention, database locking and heavy queries.

Database Sharding takes more work, but has the advantage of being a shared-nothing approach, thus it is fully scalable.

A clear indicator that database sharding is needed is when a single server cannot keep up with write volume. If you have many heavy queries, this also can drive the need for this type of solution.

Having said all of this, if you are talking about "millions" of rows with 4 - 5 columns, and your reads can be well-indexed for fast access, it is doubtful if you need to implement either of these options. When you are talking 100s of millions or billions of rows, with 1000s of users, that is where database scalability is critical.

There is also an informational Web site I am working on regarding database scalability: www.bigdatascalability.com. It includes links to various articles, and will have new content added over time.

like image 50
dbschwartz Avatar answered Oct 09 '22 08:10

dbschwartz