Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is a good size (# of rows) to partition a table to really benefit?

I.E. if we have got a table with 4 million rows.

Which has got a STATUS field that can assume the following value: TO_WORK, BLOCKED or WORKED_CORRECTLY.

Would you partition on a field which will change just one time (most of times from to_work to worked_correctly)? How many partitions would you create?

like image 929
Revious Avatar asked Jul 30 '11 20:07

Revious


2 Answers

The absolute number of rows in a partition is not the most useful metric. What you really want is a column which is stable as the table grows, and which delivers on the potential benefits of partitioning. These are: availability, tablespace management and performance.

For instance, your example column has three values. That means you can have three partitions, which means you can have three tablespaces. So if a tablespace becomes corrupt you lose one third of your data. Has partitioning made your table more available? Not really.

Adding or dropping a partition makes it easier to manage large volumes of data. But are you ever likely to drop all the rows with a status of WORKED_CORRECTLY? Highly unlikely. Has partitioning made your table more manageable? Not really.

The performance benefits of partitioning come from query pruning, where the optimizer can discount chunks of the table immediately. Now each partition has 1.3 million rows. So even if you query on STATUS='WORKED_CORRECTLY' you still have a huge number of records to winnow. And the chances are, any query which doesn't involve STATUS will perform worse than it did against the unpartitioned table. Has partitioning made your table more performant? Probably not.

So far, I have been assuming that your partitions are evenly distributed. But your final question indicates that this is not the case. Most rows - if not all - rows will end up in the WORKED_CORRECTLY. So that partition will become enormous compared to the others, and the chances of benefits from partitioning become even more remote.

Finally, your proposed scheme is not elastic. As the current volume each partition would have 1.3 million rows. When your table grows to forty million rows in total, each partition will hold 13.3 million rows. This is bad.

So, what makes a good candidate for a partition key? One which produces lots of partitions, one where the partitions are roughly equal in size, one where the value of the key is unlikely to change and one where the value has some meaning in the life-cycle of the underlying object, and finally one which is useful in the bulk of queries run against the table.

This is why something like DATE_CREATED is such a popular choice for partitioning of fact tables in data warehouses. It generates a sensible number of partitions across a range of granularities (day, month, or year are the usual choices). We get roughly the same number of records created in a given time span. Data loading and data archiving are usually done on the basis of age (i.e. creation date). BI queries almost invariably include the TIME dimension.

like image 122
APC Avatar answered Nov 15 '22 08:11

APC


The number of rows in a table isn't generally a great metric to use to determine whether and how to partition the table.

What problem are you trying to solve? Are you trying to improve query performance? Performance of data loads? Performance of purging your data?

Assuming you are trying to improve query performance? Do all your queries have predicates on the STATUS column? Are they doing single row lookups of rows? Or would you want your queries to scan an entire partition?

like image 41
Justin Cave Avatar answered Nov 15 '22 10:11

Justin Cave