I'm creating external tables in hive and then user insert overwrite directory ... to add the files. However the second time that I run my query I expect the old files to get deleted and new files replace them (because I have the overwrite option). However that is not the case and new files gets added to the directory without removing the old files which causes inconsistency in the data. What is going wrong here?
I was going to submit a bug but this is existing issue : HIVE-13997 -- apply the patch if you wish to use overwrite directory for expected results.
From Test what I have found is:
overwrite directory and overwrite table works differently: you should use overwrite table if you wish to overwrite the entire directory.
created this table for test: create external table t2 (a1 int,a2 string) LOCATION '/user/cloudera/t2';
overwrite directory:
The directory is, as you would expect, OVERWRITten; in other words, if the specified path exists, it is clobbered and replaced with the output.
hive> insert overwrite directory '/user/cloudera/t2' select * from sqoop_import.departments;
so if above statement is going to write data at location /user/cloudera/t2/000000_0 then only this location is overwritten.
#~~~~ BEFORE ~~~~
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 17:42 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
-rwxr-xr-x 1 cloudera cloudera 88 2016-07-25 15:48 /user/cloudera/t2/part-m-00000
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 15:48 /user/cloudera/t2/part-m-00001
#~~~~ AFTER: Note the timestamp ~~~~
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 18:01 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
-rwxr-xr-x 1 cloudera cloudera 88 2016-07-25 15:48 /user/cloudera/t2/part-m-00000
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 15:48 /user/cloudera/t2/part-m-00001
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
overwrite table:
The contents of the chosen table or partition are replaced with the output of corresponding select statement.
hive> insert overwrite table t2 select * from sqoop_import.departments;
Now entire dir is overwritten:
[cloudera@quickstart ~]$ hadoop fs -ls /user/cloudera/t2/*
-rwxr-xr-x 1 cloudera cloudera 60 2016-07-25 18:03 /user/cloudera/t2/000000_0
-rwxr-xr-x 1 cloudera cloudera 0 2016-07-25 15:48 /user/cloudera/t2/_SUCCESS
so in conclusion, overwrite directory only overwrite direct path of generated file not the directory. see Writing-data-into-the-file-system-from-queries
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With