I am currently doing some data exploration with Hive and cannot explain the following behavior. Say I have a table (named mytable) with a field master_id.
When I count the number of row I get
select count(*) as c from mytable
c
1129563
If I want to count the number of row with a non null master_id, I get a higher number
select count(*) as c from mytable where master_id is not null
c
1134041
Additionally, the master_id seems to be never null.
select count(*) as c from mytable where master_id is null
c
0
I cannot explain how adding a where statement can increase the number of rows eventually. Does anyone have any hint to explain this behavior ?
Thanks
Most probably your query without where is using statistics because of this parameter is set:
set hive.compute.query.using.stats=true;
Try to set it false and execute again.
Alternatively you can compute statistics on the table. See ANALYZE TABLE SYNTAX
Also it's possible to gather statistics during INSERT OVERWRITE automatically:
set hive.stats.autogather=true;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With