Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redundant Sorting for Aggregate Grouped-By Monotonic Function

I'm developing a query against a table that contains a bunch of points in a time series. The table can grow quite large, and so I want the query to effectively downsample the output by averaging points over fixed time intervals. After writing the query, I'm surprised by how SQL Server (2008) has opted to execute the query. The execution plan reveals an unnecessary sorting operation that would become expensive as the time series grows. Here is the problem, reduced to a simple example:

CREATE TABLE [dbo].[Example]
(
    [x] FLOAT NOT NULL,
    [y] FLOAT NOT NULL,
    PRIMARY KEY CLUSTERED 
    (
        [x] ASC
    )
);

SELECT FLOOR([x]), AVG([y])
FROM [dbo].[Example]
GROUP BY FLOOR([x]);

Here I have (x,y) pairs that are already sorted by x (because of the clustered primary key), and I'm averaging y for each whole number x (by truncating with the FLOOR function). I would expect that the table is already suitably sorted for the aggregate since FLOOR is a monotonic function. Unfortunately, SQL Server decides that this data needs to be re-sorted, and here is the execution plan:

Example Execution Plan

Shouldn't SQL Server be able to perform a streaming aggregation over data grouped by a monotonic function of columns that are already suitably sorted?

Is there a general way to rewrite such queries so that SQL Server will see that the order is preserved?

[Update] I've found an article on the subject Things SQL needs: sargability of monotonic functions and, as the title suggests, it seems like this is an optimization that SQL Server doesn't yet do (in most cases).

Here are even simpler queries over [dbo].[Example] that demonstrate the point:

SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY FLOOR([x]) --sort performed in execution plan

SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY 2*[x] --NO sort performed in execution plan

SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY 2*[x]+1 --sort performed in execution plan

In any single addition or multiplication, the query optimizer understands that the data already has the same order (and this is seen when you group by such expressions too). So it seems like the concept of monotonic functions is understood by the optimizer, just not generally applied.

I'm testing the computed column / index solution now, but it seems like this will dramatically increase the size of the persisted data since I will need several indices to cover the range of possible intervals.

like image 379
Michael Petito Avatar asked Jun 11 '11 22:06

Michael Petito


3 Answers

Some notes:

  • the plan that you see when table is empty and the plan when table has X rows can be absolutely different plans
  • I don't think it is correct to have primary key on X field. Can there be two points that have the same X values?

I think you will have the best query performance if you do something like this:

create table Point
(
    PointId int identity(1, 1)
        constraint PK_Example_Id primary key,
    X float not null,
    Y float not null,
    FloorX as floor(x) persisted
)

create index IX_Point_FloorX_Y on Point(FloorX, Y)

Add some rows:

declare @RowCount int = 10000
while(@RowCount > 0)
begin
    insert Point
    values (cast(crypt_gen_random(2) as int), cast(crypt_gen_random(2) as int))
    set @RowCount -= 1
end

Query:

select floor(X), avg(Y)
from Point
group by floor(X)

or

select FloorX, avg(Y)
from Point
group by FloorX

both will have the same plan

Plan: no sorting

enter image description here

Another option - you can create indexed view. In this case you will have to query the view directly, unless you have Enterprise Edition, which would use indexed view indexes even if you query table directly.

[Edit] Just realized I didn't explicitly answer your question. You asked why would SQL perform sort if X is clustered primary key. SQL does not perform sort on X, it performs sort on floor(x). In other words, if x is already sorted, then f(x) would not necessarily have the same order, right?

like image 150
Alex Aza Avatar answered Nov 16 '22 03:11

Alex Aza


SQL Server almost always ignores indexes when there is any function on the columns of the index. There are good reasons:

  • The Query Optimiser (QO) uses statistics of data distribution.
    The function changes this: do you expect statistics to be generate per query?
  • The function (in this case, definitely) can invalidate uniqueness of the index
    The QO can use uniqueness in it's plan generation

Some optimisations are coded in to QO (for example: COUNT vs EXISTS in an IF) but it doesn't do rigorous mathematical proofs: they don't apply to query response times

There is an MS Connect for some datetime functions too (which I actually disagree with because there are too many permutations of functions to optimise out: so we'd have inconsistency)

Otherwise, the indexed computed column solution from Alex Aza is what I'd do

Edit:

Read your link in updated question.

FLOOR changes strictly monotonic to monotonic. That is, x is unique so is strictly monotonic. FLOOR(x) is monotonic.

If you have any WHERE clauses statistics becomes important: as you said you posted simplified examples.

And for the x*2 + 1 example you posted: at what point do you think should SQL Server stop evaluating expressions? It's a cost based optimiser of course..

I think it's fair that SQL Server behaves this way: day to day my EXISTS optimisation example is far more useful.

like image 20
gbn Avatar answered Nov 16 '22 01:11

gbn


This is a very good question. In cases like this we want to have another table and use CROSS APPLY, like the following example that uses table Numbers that stores all the numbers between Min(X)/YourStepInMinutes AND Max(x)/YourStepInMinutes, and two more numbers around Min and Max. This query runs as nested loops, does not need a sort:

SELECT n.n, Avg(p.y)
FROM dbo.Numbers AS n
CROSS APPLY (SELECT p.y 
  FROM dbo.Points AS p 
  WHERE p.x<n*YourStepInMinutes AND (n-1)*YourStepInMinutes<=p.x
) As p

Edit: While this solution does need a join which is not free, i would not make a blanket statement that is is always slow. Sorting a lot of data may all of a sudden become very slow - you are sorting 10% rows more, and your sort may be 10 times slower. On the other hand, this approach scales better, because it does not require one big huge sort.

Also because we don't need a persisted computed column, we can immediately use this query for intervals of any size, like 17 minutes.

like image 22
A-K Avatar answered Nov 16 '22 01:11

A-K