Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does count( distinct ) with NULL columns return 0 in Hive SQL?

I have struggled with an issue in Hive SQL and just found out what the issue was:

select distinct 'A', NULL;        

returns 'A', NULL

select count(distinct 'A', NULL);  

returns 0

select count(distinct 'A', coalesce(NULL,''));  

returns 1.

I was using the select line inside of a larger query and filtered on the result (=1). With the outcome being 0, I lost a lot of rows.

How come that a row with NULL column does not contribute to the result of a count(distinct) query?

like image 512
AVS Avatar asked Sep 20 '25 18:09

AVS


1 Answers

It's the interface of count in hive:

count(*) counts all rows

count(col1) counts all rows where col1 is not null

count(distinct col1,col2...) counts all distinct rows where the specified columns are not null

As a solution to your specific problem, you can try to have a nested query with the logic and use count(*) in the outer query:

select count(*) from (select distinct 'A', NULL) a; returns 1

like image 64
Alex Libov Avatar answered Sep 22 '25 08:09

Alex Libov