Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multiple Row_Number() Calls in a Single SQL Query

I'm trying to setup some data to calculate multiple medians in SQL Server 2008, but I'm having a performance problem. Right now, I'm using this pattern ([another example bottom). Yes, I'm not using a CTE, but using one won't fix the problem I'm having anyways and the performance is poor because the row_number sub-queries run in serial, not parallel.

Here's a full example. Below the SQL I explain the problem more.

-- build the example table    

CREATE TABLE #TestMedian (
    StateID INT,
    TimeDimID INT,
    ConstructionStatusID INT,

    PopulationSize BIGINT,
    SquareMiles BIGINT
);

INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 100000, 200000);

INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 200000, 300000);

INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 300000, 400000);

INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 100000, 200000);

INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 250000, 300000);

INSERT INTO #TestMedian (StateID, TimeDimID, ConstructionStatusID, PopulationSize, SquareMiles)
VALUES (1, 1, 1, 350000, 400000);

--TruNCATE TABLE TestMedian

    SELECT
        StateID
        ,TimeDimID
        ,ConstructionStatusID
        ,NumberOfRows = COUNT(*) OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID)
        ,PopulationSizeRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY PopulationSize)
        ,SquareMilesRowNum = ROW_NUMBER() OVER (PARTITION BY StateID, TimeDimID, ConstructionStatusID ORDER BY SquareMiles)
        ,PopulationSize
        ,SquareMiles
    INTO #MedianData
    FROM #TestMedian

    SELECT MinRowNum = MIN(PopulationSizeRowNum), MaxRowNum = MAX(PopulationSizeRowNum), StateID, TimeDimID, ConstructionStatusID, MedianPopulationSize= AVG(PopulationSize) 
    FROM #MedianData T
    WHERE PopulationSizeRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
    GROUP BY StateID, TimeDimID, ConstructionStatusID

    SELECT MinRowNum = MIN(SquareMilesRowNum), MaxRowNum = MAX(SquareMilesRowNum), StateID, TimeDimID, ConstructionStatusID, MedianSquareMiles= AVG(SquareMiles) 
    FROM #MedianData T
    WHERE SquareMilesRowNum IN((NumberOfRows + 1) / 2, (NumberOfRows + 2) / 2)
    GROUP BY StateID, TimeDimID, ConstructionStatusID


    DROP TABLE #MedianData
    DROP TABLE #TestMedian

The problem with this query is that SQL Server executes both of the "ROW__NUMBER() OVER..." sub-queries in serial, not in parallel. So if I have 10 of these ROW__NUMBER calculations, it'll calculate them one after the other and I get linear growth, which stinks. I have an 8-way 32GB system I'm running this query on and I would love some parallelism. I'm trying to run this type of query on a 5,000,000 row table.

I can tell its doing this by looking at the query plan and seeing the Sorts in the same execution path (displaying the query plan's XML wouldn't work real well on SO).

So my question is this: How can I alter this query so that the ROW_NUMBER queries are executed in parallel? Is there a completely different technique I can use to prepare the data for multiple median calculations?

like image 269
JayRu Avatar asked Sep 04 '09 16:09

JayRu


People also ask

How do I number every row in SQL?

If you'd like to number each row in a result set, SQL provides the ROW_NUMBER() function. This function is used in a SELECT clause with other columns. After the ROW_NUMBER() clause, we call the OVER() function. If you pass in any arguments to OVER , the numbering of rows will not be sorted according to any column.

What is ROW_NUMBER () over partition by in SQL?

PARTITION BY It is an optional clause in the ROW_NUMBER function. It is a clause that divides the result set into partitions (groups of rows). The ROW_NUMBER() method is then applied to each partition, which assigns a separate rank number to each partition.

What does ROW_NUMBER () do in SQL?

ROW_NUMBER adds a unique incrementing number to the results grid. The order, in which the row numbers are applied, is determined by the ORDER BY expression. Most of the time, one or more columns are specified in the ORDER BY expression, but it's possible to use more complex expressions or even a sub-query.

Is ROW_NUMBER faster than distinct?

In my experience, an aggregate (DISTINCT or GROUP BY) can be quicker then a ROW_NUMBER() approach.


1 Answers

Each ROW_NUMBER requires the rows to be sorted first. Since your two RNs have different ORDER BY conditions, the query must produce the result, then order it for first RNs (it may be orderred already by), produce the RN, then order it for second RN and produce the second RN result. There simply isn't any magic pixie dust that can materialize a row number value without counting where the row is in the required order.

like image 196
Remus Rusanu Avatar answered Sep 18 '22 23:09

Remus Rusanu