I was going through filtered stats in below link.
http://blogs.msdn.com/b/psssql/archive/2010/09/28/case-of-using-filtered-statistics.aspx
Data is Skewed heavily,one region is having 0 rows,rest all are from diferent regions. Below is the entire code to reproduce the issue
create table Region(id int, name nvarchar(100))
go
create table Sales(id int, detail int)
go
create clustered index d1 on Region(id)
go
create index ix_Region_name on Region(name)
go
create statistics ix_Region_id_name on Region(id, name)
go
create clustered index ix_Sales_id_detail on Sales(id, detail)
go
-- only two values in this table as lookup or dim table
insert Region values(0, 'Dallas')
insert Region values(1, 'New York')
go
set nocount on
-- Sales is skewed
insert Sales values(0, 0)
declare @i int
set @i = 1
while @i <= 1000 begin
insert Sales values (1, @i)
set @i = @i + 1
end
go
update statistics Region with fullscan
update statistics Sales with fullscan
go
set statistics profile on
go
--note that this query will over estimate
-- it estimate there will be 500.5 rows
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)
--this query will under estimate
-- this query will also estimate 500.5 rows in fact 1000 rows returned
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile)
go
set statistics profile off
go
create statistics Region_stats_id on Region (id)
where name = 'Dallas'
go
create statistics Region_stats_id2 on Region (id)
where name = 'New York'
go
set statistics profile on
go
--now the estimate becomes accurate (1 row) because
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)
--the estimate becomes accurate (1000 rows) because stats Region_stats_id2 is used to evaluate
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile)
go
set statistics profile off
My question is we have below stats available on both tables
sp_helpstats 'region','all'
sp_helpstats 'sales','all'
Table region:
statistics_name statistics_keys
d1 id
ix_Region_id_name id, name
ix_Region_name name
Table sales:
statistics_name statistics_keys
ix_Sales_id_detail id, detail
1.Why the estimation went wrong for thse below queries
select detail from Region join Sales on Region.id = Sales.id where name='Dallas' option (recompile)
--the estimate becomes accurate (1000 rows) because stats Region_stats_id2 is used to evaluate
select detail from Region join Sales on Region.id = Sales.id where name='New York' option (recompile)
2.When i created filtered stat as per author,i could see estimates correctly,but why we need to create filtered stats,how can i say i need filtered stats for my queries since even when i created simple stats,i got same result .
Best i came across so far
1.Kimberely tripp skewed stats video
2.Technet stats whitepaper
But still not able to understand why filtered stats made a difference here
thanks in advance. Update :7/4
Rephrasing the question after martin and james answers:
1.Is there any way to avoid data skewness
other than kimberely script ,one more way to estimate is to count number of rows for a value.
2.Have you faced any issues with data skewness in your experience.I assume it depends on large tables.But i am looking for some detailed answer
3.We have to take the IO cost for the sql to scan the table and along with some blockings sometimes for a query which falls at the time of triggering update stats.do you see any overhead other than this in maintaining stats.
Reason being i am thinking to create filetered stats based on several conditions based on DTA input too.
thanks again
I would assume this is why it happens. You get the same estimate (500.5) rows because that SQL Server doesn't have statistics that would tell which IDs are the one that are related to which region. The statistics ix_Region_id_name have both fields, but since histogram exists for the first column only, it really doesn't help in estimations regarding how many rows will be in Sales table.
If you run dbcc show_statistics ('Region','ix_Region_id_name')
, the result will be:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
0 0 1 0 1
1 0 1 0 1
So this tells that there is 1 row for each ID, but there's no link to the names.
But when you create the statistics Region_stats_id (for Dallas) dbcc show_statistics ('Region','Region_stats_id')
will show:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
0 0 1 0 1
So SQL Server knows that there is only 1 row, and it's ID 0.
Similarly Region_stats_id2:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1 0 1 0 1
And the amount of rows in sales is in ix_Sales_id_detail will help to determine rows per ID:
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
0 0 1 0 1
1 0 1000 0 1
Info: This is now copy of the answer deleted by @MartijnPieters because this is the question I intended to answer for -- and I can't seem to do anything to the deleted answer. I accidentally wrote this first to TheGameiswar's other statistics question from today but I deleted myself already.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With