I'm trying to setup some data to calculate multiple medians in SQL Server 2008, but I'm having a performance problem. Right now, I'm using this pattern ([another example bottom). Yes, I'm not using a CTE, but using one won't fix the problem I'm having anyways and the performance is poor because the row_number sub-queries run in serial, not parallel.
Here's a full example. Below the SQL I explain the problem more.
-- build the example table
CREATE TABLE #TestMedian (
StateID INT,
TimeDimID INT,
ConstructionStatusID INT,
PopulationSize BIGINT,
SquareMiles BIGINT
);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 100000, 200000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 200000, 300000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 300000, 400000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 100000, 200000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 250000, 300000);
INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 350000, 400000);
--TruNCATE TABLE TestMedian
SELECT
StateID
,TimeDimID
,ConstructionStatusID
,NumberOfRows = COUNT(*) OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID)
,PopulationSizeRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY PopulationSize)
,SquareMilesRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY SquareMiles)
,PopulationSize
,SquareMiles
INTO #MedianData
FROM #TestMedian
SELECT MinRowNum = MIN(PopulationSizeRowNum), MaxRowNum = MAX(PopulationSizeRowNum), StateID, TimeDimID, ConstructionStatusID, MedianPopulationSize= AVG(PopulationSize)
FROM #MedianData T
WHERE PopulationSizeRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
GROUP BY StateID, TimeDimID, ConstructionStatusID
SELECT MinRowNum = MIN(SquareMilesRowNum), MaxRowNum = MAX(SquareMilesRowNum), StateID, TimeDimID, ConstructionStatusID, MedianSquareMiles= AVG(SquareMiles)
FROM #MedianData T
WHERE SquareMilesRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
GROUP BY StateID, TimeDimID, ConstructionStatusID
DROP TABLE #MedianData
DROP TABLE #TestMedian
The problem with this query is that SQL Server executes both of the "ROW__NUMBER() OVER..." sub-queries in serial, not in parallel. So if I have 10 of these ROW__NUMBER calculations, it'll calculate them one after the other and I get linear growth, which stinks. I have an 8-way 32GB system I'm running this query on and I would love some parallelism. I'm trying to run this type of query on a 5,000,000 row table.
I can tell its doing this by looking at the query plan and seeing the Sorts in the same execution path (displaying the query plan's XML wouldn't work real well on SO).
So my question is this: How can I alter this query so that the ROW_NUMBER queries are executed in parallel? Is there a completely different technique I can use to prepare the data for multiple median calculations?
If you'd like to number each row in a result set, SQL provides the ROW_NUMBER() function. This function is used in a SELECT clause with other columns. After the ROW_NUMBER() clause, we call the OVER() function. If you pass in any arguments to OVER , the numbering of rows will not be sorted according to any column.
PARTITION BY It is an optional clause in the ROW_NUMBER function. It is a clause that divides the result set into partitions (groups of rows). The ROW_NUMBER() method is then applied to each partition, which assigns a separate rank number to each partition.
ROW_NUMBER adds a unique incrementing number to the results grid. The order, in which the row numbers are applied, is determined by the ORDER BY expression. Most of the time, one or more columns are specified in the ORDER BY expression, but it's possible to use more complex expressions or even a sub-query.
In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach.
Each ROW_NUMBER requires the rows to be sorted first. Since your two RNs have different ORDER BY conditions, the query must produce the result, then order it for first RNs (it may be orderred already by), produce the RN, then order it for second RN and produce the second RN result. There simply isn't any magic pixie dust that can materialize a row number value without counting where the row is in the required order.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With