Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest Way to Count Distinct Values in a Column, Including NULL Values

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.

This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.

This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).

What is the most efficient way to do this?


Update After Testing

I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest

  1. Count/GroupBy (Cheran)
  2. CountDistinct+SubQuery (Ellis)
  3. dense_rank (Eriksson)
  4. Count+Max (Andriy)

Testing Results (in seconds):

   Method          20_Columns   5_Columns   1_Column (Large)   1_Column (Small)
1) Count/GroupBy      10.8          4.8            2.8               0.14       
2) CountDistinct      12.4          4.8            3                 0.7         
3) dense_rank        226           30              6                 4.33 
4) Count+Max          98.5         44             16                12.5        

Notes:

  • Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
  • Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.

Thanks for the help and suggestions!

like image 843
Yaakov Ellis Avatar asked Sep 11 '11 05:09

Yaakov Ellis


2 Answers

A development on OP's own solution:

SELECT
  COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable
like image 176
Andriy M Avatar answered Sep 28 '22 02:09

Andriy M


SELECT COUNT(*)
FROM (SELECT ColumnName
      FROM TableName
      GROUP BY ColumnName) AS s;

GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.

like image 31
Cheran Shunmugavel Avatar answered Sep 28 '22 00:09

Cheran Shunmugavel