Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a difference between table distribution and table partition in sql?

I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?

Definition of both seems to be achieving same results.

like image 827
Amit Soni Avatar asked Aug 03 '18 17:08

Amit Soni


People also ask

What is the difference between partition and distribution?

The partition coefficient generally refers to the concentration ratio of un-ionized species of compound, whereas the distribution coefficient refers to the concentration ratio of all species of the compound (ionized plus un-ionized). In the chemical and pharmaceutical sciences, both phases usually are solvents.

What is table distribution in SQL?

A distributed table appears as a single table, but the rows are actually stored across 60 distributions. The rows are distributed with a hash or round-robin algorithm. Hash-distribution improves query performance on large fact tables, and is the focus of this article.

What's the difference between a partition and an index in SQL?

Indexes are used to speed the search of data within tables. Partitions provide segregation of the data at the hdfs level, creating sub-directories for each partition. Partitioning allows the number of files read and amount of data searched in a query to be limited.

What are the types of table partitions in SQL Server?

There are two types of Partitioning in SQL Server: Vertical Partitioning. Horizontal partitioning.


1 Answers

Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.

enter image description here

That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.

enter image description here


Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).

Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.

Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:

enter image description here


Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.

In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.

like image 176
JNevill Avatar answered Sep 30 '22 04:09

JNevill