Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partitioning for query performance in SQL Server 2008

Tags:

sql-server

I have a scenario in which there's a huge amount of status data about an item. The item's status is updated from minute to minute, and there will be about 50,000 items in the near future. So that, in one month, there will be about 2,232,000,000 rows of data. I must keep at least 3 months in the main table, before archieving older data.

I must plan to achieve quick queries, based on a specific item (its ID) and a data range (usually, up to one month range) - e.g. select A, B, C from Table where ItemID = 3000 and Date between '2010-10-01' and '2010-10-31 23:59:59.999'

So my question is how to design a partitioning structure to achieve that?

Currently, I'm partitioning based on the "item's unique identifier" (an int) mod "the number of partitions", so that all partitions are equally distributed. But it has the drawback of keeping one additional column on the table to act as the partition column to the partition function, therefore, mapping the row to its partition. All that add a little bit of extra storage. Also, each partition is mapped to a different filegroup.

like image 734
gsb Avatar asked Nov 22 '10 18:11

gsb


3 Answers

Partitioning is never done for query performance. With partitioning the performance will always be worse, the best you can hope for is no big regression, but never improvement.

For query performance, anything a partition can do, and index can do better, and that should be your answer: index appropriately.

Partitioning is useful for IO path control cases (distribute on archive/current volumes) or for fast switch-in switch-out scenarios in ETL loads. So I would understand if you had a sliding window and partition by date so you can quickly switch out the data that is no longer needed to be retained.

Another narrow case for partitioning is last page insert latch contention, like described in Resolving PAGELATCH Contention on Highly Concurrent INSERT Workloads.

Your partition scheme and use case does not seem to fit any of the scenarios in which it would benefit (maybe is the last scenario, but is not clear from description), so most likely it hurts performance.

like image 135
Remus Rusanu Avatar answered Oct 13 '22 03:10

Remus Rusanu


I do not really agree with Remus Rusanu. I think the partitioning may improve performance if there's a logical reason (related to your use cases). My guess is that you could partition ONLY on the itemID. The alternative would be to use the date as well, but if you cannot predict that a date range will not cross the boundaries of a given partition (no queries are sure to be with a single month), then I would stick to itemId partitioning.

If there are only a few items you need to compute, another option is to have a covering index: define an INDEX on you main differentiation field (the itemId) which INCLUDEs the fields you need to compute.

CREATE INDEX idxTest ON itemId INCLUDE quantity;
like image 44
iDevlop Avatar answered Oct 13 '22 03:10

iDevlop


Applicative partitioning actually CAN be beneficial for query performance. In your case you have 50K items and 2G rows. You could for example create 500 tables, each named status_nnn where nnn is between 001 and 500 and "partition" your item statuses equally among these tables, where nnn is a function of the item id. This way, given an item id, you can limit your search a priori to 0.2% of the whole data (ca. 4M rows).

This approach has a lot of disadvantages, as you'll probably have to deal with dynamic sql and a other unpleasant issues, especially if you need to aggregate data from different tables. BUT, it will definitely improve performance for certain queries, s.a. the ones you mention.

Essentially applicative partitioning is similar to creating a very wide and flat index, optimized for very specific queries w/o duplicating the data.

Another benefit of applicative partitioning is that you could in theory (depending on your use case) distribute your data among different databases and even different servers. Again, this depends very much on your specific requirements, but I've seen and worked with huge data sets (billions of rows) where applicative partitioning worked very well.

like image 23
Manu Avatar answered Oct 13 '22 02:10

Manu