I'm trying to debug a fairly complex stored procedure that joins across many tabls (10-11). I'm seeing that for a part of the tree the estimated number of rows drasticly differs from the actual number of rows - at its worst SQL server estimates that 1 row will be returned, when in actuality 55,000 rows are returned!
I'm trying to work out why this is - all of my statistics are up-to-date, and I've updated statistics with a FULLSCAN on several tables. I'm not using any user defined functions or table variables. As far as I can see SQL server should be able to exactly estimate how many rows are going to be returned, but it continues to choose a plan which cases it to perform tens of thousands of RDI lookups (when it is expecting to only perform 1 or 2).
What can I do to try and understand why the estimated number of rows is out by so much?
UPDATE: So looking at the plan I've found one node in particular which seems suspicous - its a table scan on a table using the following predecate:
status <> 5
AND [type] = 1
OR [type] = 2
This predicate returns the entire table (630 rows - the table scan itself it NOT the source of the poor performance) however SQL server has the estimated number of rows at just 37. SQL server then goes on to do several nested loops with this onto RDI lookups, index scans and index seeks. Could this be the source of my massive miscalculation? How do I get it to estimate a more sensible number of rows?
SQL Server
splits each index into up to 200
ranges with the following data (from here):
RANGE_HI_KEY
A key value showing the upper boundary of a histogram step.
RANGE_ROWS
Specifies how many rows are inside the range (they are smaller than this
RANGE_HI_KEY
, but bigger than the previous smallerRANGE_HI_KEY
).
EQ_ROWS
Specifies how many rows are exactly equal to
RANGE_HI_KEY
.
AVG_RANGE_ROWS
Average number of rows per distinct value inside the range.
DISTINCT_RANGE_ROWS
Specifies how many distinct key values are inside this range (not including the previous key before
RANGE_HI_KEY
andRANGE_HI_KEY
itself);
Usually, most populated values go into RANGE_HI_KEY
.
However, they can get into the range and this can lead to the skew in distribution.
Imagine these data (among the others):
Key value Count of rows
1 1
2 1
3 10000
4 1
SQL Server
usually builds two ranges: 1
to 3
and 4
to the next populated value, which makes these statistics:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS AVG_RANGE_ROWS DISTINCT_RANGE_ROWS
3 2 10000 1 2
, which means the when searching for, say, 2
, there is but 1
row and it's better to use the index access.
But if 3
goes inside the range, the statistics are these:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS AVG_RANGE_ROWS DISTINCT_RANGE_ROWS
4 10002 1 3334 3
The optimizer thinks there are 3334
rows for the key 2
and index access is too expensive.
It uses statistics, which it keeps for each index.
(You can also create statistics on non-indexed columns)
To update all your statistics on every table in a Database (WARNING: will take some time on very large databases. Don't do this on Production servers without checking with your DBA...):
exec sp_msforeachtable 'UPDATE STATISTICS ?'
If you don't have a regular scheduled job to rebuild your most active indexes (i.e. lots of INSERTS or DELETES), you should consider rebuilding your indexes (same caveat as above applies):
exec sp_msforeachtable "DBCC DBREINDEX('?')"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With