Can I optimize a SELECT DISTINCT x FROM hugeTable query by creating an index on column x?

Tags:

I have a huge table, having a much smaller number (by orders of magnitude) of distinct values on some column x.

I need to do a query like SELECT DISTINCT x FROM hugeTable, and I want to do this relatively fast.

I did something like CREATE INDEX hugeTable_by_x ON hugeTable(x), but for some reason, even though the output is small, the query execution is not as fast. The query plan shows that 97% of the time is spent on Index Scan of hugeTable_by_x, with an estimated number of rows equal to the size of the entire table. This is followed by, among other things, a Hash Match operation.

Since I created an index on column x, can I not expect this query to run very quickly?

Note that I'm using Microsoft SQL Server 2005.

393

asked May 12 '11 05:05

polygenelubricants

3 Answers

This is likely not a problem of indexing, but one of data design. Normalization, to be precise. The fact that you need to query distinct values of a field, and even willing to add an index, is a strong indicator that the field should be normalized into a separate table with a (small) join key. Then the distinct values will be available immediately by scanning the much smaller lookup foreign table.

Update
As a workaround, you can create an indexed view on an aggregate by the 'distinct' field. COUNT_BIG is an aggregate that is allowed in indexed views:

create view vwDistinct with schemabinding as select x, count_big(*) from schema.hugetable group by x;  create clustered index cdxDistinct on vwDistinct(x);  select x from vwDistinct with (noexpand);

answered Sep 25 '22 14:09

Remus Rusanu

SQL Server does not implement any facility to seek directly to the next distinct value in an index skipping duplicates along the way.

If you have many duplicates then you may be able to use a recursive CTE to simulate this. The technique comes from here. ("Super-fast DISTINCT using a recursive CTE"). For example:

with recursivecte as (
  select min(t.x) as x
  from hugetable t
  union all
  select ranked.x
  from (
    select t.x,
           row_number() over (order by t.x) as rnk
    from hugetable t
    join recursivecte r
      on r.x < t.x
  ) ranked
  where ranked.rnk = 1
)
select *
from recursivecte
option (maxrecursion 0)

answered Sep 22 '22 14:09

Martin Smith

If you know the values in advance and there is an index on column x (or if each value is likely to appear quickly on a seq scan of the whole table), it is much faster to query each one individually:

select vals.x
from [values] as vals (x)
where exists (select 1 from bigtable where bigtable.x = vals.x);

Proceeding using exists() will do as many index lookups as there are valid values.

The way you've written it (which is correct if the values are not known in advance), the query engine will need to read the whole table and hash aggregate the mess to extract the values. (Which makes the index useless.)

answered Sep 22 '22 14:09

Denis de Bernardy

Related questions
                            
                                MySQLi prepared statements with IN operator [duplicate]
                            
                                ORA-01036: illegal variable name/number C# (SqlDataSource) Oracle 11g
                            
                                Rank not being determined properly
                            
                                Parsing nested xml into denormalized table
                            
                                How to do a group by clause with a max
                            
                                Run Update query within VBA code
                            
                                initialize and increment variable inside cte query sqlserver 2008
                            
                                SQL update a column that lost order?
                            
                                Select a maximum of 2 rows where a column has the same value
                            
                                Parent-child ordering in same table
                            
                                Union of a query with itself generates different plan
                            
                                Crystal Reports Need to Group by Derived Date Range
                            
                                Find next time business is open; mysql hours calculation
                            
                                SQL pattern matching
                            
                                How get around the arithmetic overflow error converting expression to data type int?
                            
                                Average over hard to define partition
                            
                                How can I get the text of a stored procedure into a single record in SQL Server 2000?
                            
                                SQL foreach using table rows
                            
                                Why is SQL Server '=' comparator case insensitive?
                            
                                Alternative to BigQuery for medium-sized data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can I optimize a SELECT DISTINCT x FROM hugeTable query by creating an index on column x?

Tags:

sql

tsql

indexing

sql-server-2005

query-optimization