Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does SQL server work out the estimated number of rows?

I'm trying to debug a fairly complex stored procedure that joins across many tabls (10-11). I'm seeing that for a part of the tree the estimated number of rows drasticly differs from the actual number of rows - at its worst SQL server estimates that 1 row will be returned, when in actuality 55,000 rows are returned!

I'm trying to work out why this is - all of my statistics are up-to-date, and I've updated statistics with a FULLSCAN on several tables. I'm not using any user defined functions or table variables. As far as I can see SQL server should be able to exactly estimate how many rows are going to be returned, but it continues to choose a plan which cases it to perform tens of thousands of RDI lookups (when it is expecting to only perform 1 or 2).

What can I do to try and understand why the estimated number of rows is out by so much?

UPDATE: So looking at the plan I've found one node in particular which seems suspicous - its a table scan on a table using the following predecate:

status <> 5
AND [type] = 1
OR [type] = 2

This predicate returns the entire table (630 rows - the table scan itself it NOT the source of the poor performance) however SQL server has the estimated number of rows at just 37. SQL server then goes on to do several nested loops with this onto RDI lookups, index scans and index seeks. Could this be the source of my massive miscalculation? How do I get it to estimate a more sensible number of rows?

like image 881
Justin Avatar asked Sep 25 '09 11:09

Justin


Video Answer


2 Answers

SQL Server splits each index into up to 200 ranges with the following data (from here):

  • RANGE_HI_KEY

    A key value showing the upper boundary of a histogram step.

  • RANGE_ROWS

    Specifies how many rows are inside the range (they are smaller than this RANGE_HI_KEY, but bigger than the previous smaller RANGE_HI_KEY).

  • EQ_ROWS

    Specifies how many rows are exactly equal to RANGE_HI_KEY.

  • AVG_RANGE_ROWS

    Average number of rows per distinct value inside the range.

  • DISTINCT_RANGE_ROWS

    Specifies how many distinct key values are inside this range (not including the previous key before RANGE_HI_KEY and RANGE_HI_KEY itself);

Usually, most populated values go into RANGE_HI_KEY.

However, they can get into the range and this can lead to the skew in distribution.

Imagine these data (among the others):

Key value Count of rows

1          1
2          1
3          10000
4          1

SQL Server usually builds two ranges: 1 to 3 and 4 to the next populated value, which makes these statistics:

RANGE_HI_KEY  RANGE_ROWS  EQ_ROWS  AVG_RANGE_ROWS  DISTINCT_RANGE_ROWS
3             2           10000    1               2

, which means the when searching for, say, 2, there is but 1 row and it's better to use the index access.

But if 3 goes inside the range, the statistics are these:

RANGE_HI_KEY  RANGE_ROWS  EQ_ROWS  AVG_RANGE_ROWS  DISTINCT_RANGE_ROWS
4             10002       1        3334            3

The optimizer thinks there are 3334 rows for the key 2 and index access is too expensive.

like image 154
Quassnoi Avatar answered Oct 13 '22 01:10

Quassnoi


It uses statistics, which it keeps for each index.

(You can also create statistics on non-indexed columns)

To update all your statistics on every table in a Database (WARNING: will take some time on very large databases. Don't do this on Production servers without checking with your DBA...):

exec sp_msforeachtable 'UPDATE STATISTICS ?'

If you don't have a regular scheduled job to rebuild your most active indexes (i.e. lots of INSERTS or DELETES), you should consider rebuilding your indexes (same caveat as above applies):

exec sp_msforeachtable "DBCC DBREINDEX('?')"
  • Statistics Used by the Query Optimizer in Microsoft SQL Server 2008
like image 22
Mitch Wheat Avatar answered Oct 13 '22 01:10

Mitch Wheat