Need help understanding this SQL Server behavior
I have a fairly basic query like
select x, y, sum(z)
from table
where date between @start and @end
group by x, y
There's a large number of rows (filter condition retrieves 6 million rows out of 16 million total)
The thing I don't understand is: this query is slow and I get a warning about spilling to tempdb. But if I change it and simply replace @start
and @end
with the same dates directly, it's much faster and there's no warning about tempdb spillage.
My guess is that the tempdb spill is because of cardinality estimates.
It appears that when I'm using variables, the statistics are way off. It's estimating about 1.45 million rows instead of 6 million.
When I use literals, the estimates are almost exactly correct.
How can I get correct estimates, and avoid tempdb spill, when using variables?
The tempdb spill was because of estimates, which weren't right because I was using local variables.
Some stuff on why they were wrong because of the local variables:
SQL Server cardinality estimation can use two types of statistics to guess how many rows will get through a predicate filter:
If you're not familiar with statistics objects and their density vectors / histograms, read this.
When a literal is used, the cardinality estimator can search for that literal in the histogram (the second type of statistics). When a parameter is used, its value is not evaluated until after cardinality estimation, so the CE has to use column averages in the density vector (the first type of statistics).
In general, you'll get better estimations using literals because the statistics in the histogram are tailored to the value of the literal, rather than being averaged over the whole column.
Example
Case 1: Literal
I'm running the following query on the AdventureWorks2012_Data database:
SELECT *
FROM Sales.SalesOrderDetail
WHERE UnitPriceDiscount = 0
We've got a literal, so the CE will look for the value 0
in the UnitPriceDiscount histogram to figure out how many rows will be returned.
I've run debugging output to see which statistics object is being used and queried that object to see its contents, and here's the histogram:
The value 0
is a RANGE_HI_KEY, so the estimated number of rows with that value is its EQ_ROWS column - in this case 117996.9.
Now let's look at the execution plan for the query:
The 'Filter' step is getting rid of all rows that don't match the predicate, so the 'Estimated Number of Rows' section of its properties has the result of the cardinality estimation:
That's the value we saw in the histogram, rounded.
Case 2: Parameter
Now we'll try with a parameter:
DECLARE @temp int = 0
SELECT *
FROM Sales.SalesOrderDetail
WHERE UnitPriceDiscount = @temp
The cardinality estimator doesn't have a literal to search for in the histogram, so it has to use the overall density of the column from the density vector:
This number is:
1 / the number of distinct values in the UnitPriceDiscount column
So if you multiply that by the number of rows in the table, you get the average number of rows per value in this column. There are 121317 rows in Sales.SalesOrderDetail, so the calculation is:
121317 * 0.1111111 = 13479.6653187
The execution plan:
The properties of the filter:
So the new estimation is coming from the density vector, not the histogram.
Let me know if you have a look at the stats object and it doesn't add up as above.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With