Why does count( distinct ) with NULL columns return 0 in Hive SQL?

Question

I have struggled with an issue in Hive SQL and just found out what the issue was:

select distinct 'A', NULL;

returns 'A', NULL

select count(distinct 'A', NULL);

returns 0

select count(distinct 'A', coalesce(NULL,''));

returns 1.

I was using the select line inside of a larger query and filtered on the result (=1). With the outcome being 0, I lost a lot of rows.

How come that a row with NULL column does not contribute to the result of a count(distinct) query?

Alex Libov · Accepted Answer

It's the interface of count in hive:

count(*) counts all rows

count(col1) counts all rows where col1 is not null

count(distinct col1,col2...) counts all distinct rows where the specified columns are not null

As a solution to your specific problem, you can try to have a nested query with the logic and use count(*) in the outer query:

select count(*) from (select distinct 'A', NULL) a; returns 1

Donate For Us