I'm developing a query against a table that contains a bunch of points in a time series. The table can grow quite large, and so I want the query to effectively downsample the output by averaging points over fixed time intervals. After writing the query, I'm surprised by how SQL Server (2008) has opted to execute the query. The execution plan reveals an unnecessary sorting operation that would become expensive as the time series grows. Here is the problem, reduced to a simple example:
CREATE TABLE [dbo].[Example]
(
[x] FLOAT NOT NULL,
[y] FLOAT NOT NULL,
PRIMARY KEY CLUSTERED
(
[x] ASC
)
);
SELECT FLOOR([x]), AVG([y])
FROM [dbo].[Example]
GROUP BY FLOOR([x]);
Here I have (x,y) pairs that are already sorted by x (because of the clustered primary key), and I'm averaging y for each whole number x (by truncating with the FLOOR
function). I would expect that the table is already suitably sorted for the aggregate since FLOOR
is a monotonic function. Unfortunately, SQL Server decides that this data needs to be re-sorted, and here is the execution plan:
Shouldn't SQL Server be able to perform a streaming aggregation over data grouped by a monotonic function of columns that are already suitably sorted?
Is there a general way to rewrite such queries so that SQL Server will see that the order is preserved?
[Update] I've found an article on the subject Things SQL needs: sargability of monotonic functions and, as the title suggests, it seems like this is an optimization that SQL Server doesn't yet do (in most cases).
Here are even simpler queries over [dbo].[Example]
that demonstrate the point:
SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY FLOOR([x]) --sort performed in execution plan
SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY 2*[x] --NO sort performed in execution plan
SELECT [x], [y]
FROM [dbo].[Example]
ORDER BY 2*[x]+1 --sort performed in execution plan
In any single addition or multiplication, the query optimizer understands that the data already has the same order (and this is seen when you group by such expressions too). So it seems like the concept of monotonic functions is understood by the optimizer, just not generally applied.
I'm testing the computed column / index solution now, but it seems like this will dramatically increase the size of the persisted data since I will need several indices to cover the range of possible intervals.
Some notes:
I think you will have the best query performance if you do something like this:
create table Point
(
PointId int identity(1, 1)
constraint PK_Example_Id primary key,
X float not null,
Y float not null,
FloorX as floor(x) persisted
)
create index IX_Point_FloorX_Y on Point(FloorX, Y)
Add some rows:
declare @RowCount int = 10000
while(@RowCount > 0)
begin
insert Point
values (cast(crypt_gen_random(2) as int), cast(crypt_gen_random(2) as int))
set @RowCount -= 1
end
Query:
select floor(X), avg(Y)
from Point
group by floor(X)
or
select FloorX, avg(Y)
from Point
group by FloorX
both will have the same plan
Plan: no sorting
Another option - you can create indexed view. In this case you will have to query the view directly, unless you have Enterprise Edition, which would use indexed view indexes even if you query table directly.
[Edit] Just realized I didn't explicitly answer your question. You asked why would SQL perform sort if X
is clustered primary key. SQL does not perform sort on X
, it performs sort on floor(x)
. In other words, if x
is already sorted, then f(x)
would not necessarily have the same order, right?
SQL Server almost always ignores indexes when there is any function on the columns of the index. There are good reasons:
Some optimisations are coded in to QO (for example: COUNT vs EXISTS in an IF) but it doesn't do rigorous mathematical proofs: they don't apply to query response times
There is an MS Connect for some datetime functions too (which I actually disagree with because there are too many permutations of functions to optimise out: so we'd have inconsistency)
Otherwise, the indexed computed column solution from Alex Aza is what I'd do
Edit:
Read your link in updated question.
FLOOR changes strictly monotonic to monotonic. That is, x is unique so is strictly monotonic. FLOOR(x) is monotonic.
If you have any WHERE clauses statistics becomes important: as you said you posted simplified examples.
And for the x*2 + 1 example you posted: at what point do you think should SQL Server stop evaluating expressions? It's a cost based optimiser of course..
I think it's fair that SQL Server behaves this way: day to day my EXISTS optimisation example is far more useful.
This is a very good question. In cases like this we want to have another table and use CROSS APPLY, like the following example that uses table Numbers that stores all the numbers between Min(X)/YourStepInMinutes AND Max(x)/YourStepInMinutes, and two more numbers around Min and Max. This query runs as nested loops, does not need a sort:
SELECT n.n, Avg(p.y)
FROM dbo.Numbers AS n
CROSS APPLY (SELECT p.y
FROM dbo.Points AS p
WHERE p.x<n*YourStepInMinutes AND (n-1)*YourStepInMinutes<=p.x
) As p
Edit: While this solution does need a join which is not free, i would not make a blanket statement that is is always slow. Sorting a lot of data may all of a sudden become very slow - you are sorting 10% rows more, and your sort may be 10 times slower. On the other hand, this approach scales better, because it does not require one big huge sort.
Also because we don't need a persisted computed column, we can immediately use this query for intervals of any size, like 17 minutes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With