I have struggled with an issue in Hive SQL and just found out what the issue was:
select distinct 'A', NULL;
returns 'A', NULL
select count(distinct 'A', NULL);
returns 0
select count(distinct 'A', coalesce(NULL,''));
returns 1.
I was using the select line inside of a larger query and filtered on the result (=1). With the outcome being 0, I lost a lot of rows.
How come that a row with NULL column does not contribute to the result of a count(distinct) query?
It's the interface of count
in hive:
count(*)
counts all rows
count(col1)
counts all rows where col1 is not null
count(distinct col1,col2...)
counts all distinct rows where the specified columns are not null
As a solution to your specific problem, you can try to have a nested query with the logic and use count(*)
in the outer query:
select count(*) from (select distinct 'A', NULL) a;
returns 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With