Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HIVE select count(*) non null returns higher value than select count(*)

I am currently doing some data exploration with Hive and cannot explain the following behavior. Say I have a table (named mytable) with a field master_id.

When I count the number of row I get

select count(*) as c from mytable 
c
1129563

If I want to count the number of row with a non null master_id, I get a higher number

select count(*) as c from mytable where master_id is not null
c
1134041

Additionally, the master_id seems to be never null.

select count(*) as c from mytable where master_id is null
c
0

I cannot explain how adding a where statement can increase the number of rows eventually. Does anyone have any hint to explain this behavior ?

Thanks

like image 388
z_eb Avatar asked Oct 07 '16 08:10

z_eb


1 Answers

Most probably your query without where is using statistics because of this parameter is set:

set hive.compute.query.using.stats=true;

Try to set it false and execute again.

Alternatively you can compute statistics on the table. See ANALYZE TABLE SYNTAX

Also it's possible to gather statistics during INSERT OVERWRITE automatically:

set hive.stats.autogather=true;
like image 188
leftjoin Avatar answered Oct 05 '22 19:10

leftjoin