Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server 2008 Partitioned Table and Parallelism

My company is moving to SQL Server 2008 R2. We have a table with tons of archive data. Majority of the queries that uses this table employs DateTime value in the where statement. For example:

Query 1

SELECT COUNT(*) 
FROM TableA 
WHERE 
     CreatedDate > '1/5/2010' 
     and CreatedDate < '6/20/2010'  

I'm making the assumption that partitions are created on CreatedDate and each partition is spread out across multiple drives, we have 8 CPUs, and there are 500 million records in the database that are evenly spread out across the dates from 1/1/2008 to 2/24/2011 (38 partitions). This data could also be portioned in to quarters of a year or other time durations, but lets keep the assumptions to months.

In this case I would believe that the 8 CPU's would be utilized, and only the 6 partitions would be queried for dates between 1/5/2010 and 6/20/2010.

Now what if I ran the following query and my assumptions are the same as above.

Query 2

SELECT COUNT(*) 
FROM TableA 
WHERE State = 'Colorado'

Questions?
1. Will all partitions be queried? Yes
2. Will all 8 CPUs be used to execute the query? Yes
3. Will performance be better than querying a table that is not partitoned? Yes
4. Is there anything else I'm missing?
5. How would Partition Index help?

I answer the first 3 questions above, base on my limited knowledge of SQL Server 2008 Partitioned Table & Parallelism. But if my answers are incorrect, can you provide feedback any why I'm incorrect.

Resource:

  • Video: Demo SQL Server 2008 Partitioned Table Parallelism (5 minutes long)
  • MSDN: Partitioned Tables and Indexes
  • MSDN: Designing Partitions to Manage Subsets of Data
  • MSDN: Query Processing Enhancements on Partitioned Tables and Indexes
  • MSDN: Word Doc: Partitioned Table and Index Strategies Using SQL Server 2008 white paper

BarDev

like image 424
Mike Barlow - BarDev Avatar asked Feb 24 '23 23:02

Mike Barlow - BarDev


2 Answers

Partitioning is never an option for improving performance. The best you can hope for is to have on-par performance with non-partitioned table. Usually you get a regression that increases with the number of partitions. For performance you need indexes, not partitions. Partitions are for data management operations: ETL, archival etc. Some claim that partition elimination is possible performance gain, but for anything partition elimination can give placing the leading index key on the same column as the partitioning column will give much better results.

Will all partitions be queried?

That query needs an index on State. Otherwise is a table scan, and will scan the entire table. A table scan over a partitioned table is always slower than a scan over the same size non-partitioned table. The index itself can be aligned on the same partition scheme, but the leading key must be State.

Will all 8 CPUs be used to execute the query?

Parallelism has nothing to do with partitioning, despite the common misconception of the contrary. Both partitioned and non-partitioned range scans can be use a parallel operator, it will be the Query Optimizer decision.

Will performance be better than querying a table that is not partitioned?

No

How would Partition Index help?

An index will help. If the index has to be aligned, then it must be partitioned. A non-partitioned index will be faster than a partitioned one, but the index alignment requirement for switch-in/switch-out operations cannot be circumvented.

If you're looking at partitioning, it should be because you need to do fast switch-in switch-out operations to delete old data past retention policy period or something similar. For performance, you need to look at indexes, not at partitioning.

like image 78
Remus Rusanu Avatar answered Mar 05 '23 17:03

Remus Rusanu


Partitioning can increase performance--I have seen it many times. The reason partitioning was developed was and is performance, especially for inserts. Here is an example from the real world:

I have multiple tables on a SAN with one big ole honking disk as far as we can tell. The SAN administrators insist that the SAN knows all so will not optimize the distribution of data. How can a partition possibly help? Fact: it did and does.

We partitioned multiple tables using the same scheme (FileID%200) with 200 partitions ALL on primary. What use would that be if the only reason to have a partitioning scheme is for "swapping"? None, but the purpose of partitioning is performance. You see, each of those partitions has its own paging scheme. I can write data to all of them at once and there is no possibility of a deadlock. The pages cannot be locked because each writing process has an unique ID that equates to a partition. 200 partitions increased performance 2000x (fact) and deadlocks dropped from 7500 per hour to 3-4 per day. This for the simple reason that page lock escalation always occurs with large amounts of data and a high volume OLTP system and page locks are what cause deadlocks. Partitioning, even on the same volume and file group, places the partitioned data on different pages and lock escalation has no effect since processes are not attempting to access the same pages.

THe benefit is there, but not as great, for selecting data. But typically the partitioning scheme would be developed with the purpose of the DB in mind. I am betting Remus developed his scheme with incremental loading (such as daily loads) rather than transactional processing in mind. Now if one were frequently selecting rows with locking (read committed) then deadlocks could result if processes attempted to access the same page simultaneously.

But Remus is right--in your example I see no benefit, in fact there may be some overhead cost in finding the rows across different partitions.

like image 24
user933581 Avatar answered Mar 05 '23 17:03

user933581