I have a bucketed hive table. It has 4 buckets.
CREATE TABLE user(user_id BIGINT, firstname STRING, lastname STRING)
COMMENT 'A bucketed copy of user_info'
CLUSTERED BY(user_id) INTO 4 BUCKETS;
Initially i have inserted some records into this table using the following query.
set hive.enforce.bucketing = true;
insert into user
select * from second_user;
After this operation In HDFS I see that 4 files are created under this table dir.
Again i needed to insert another set of data into user table. So i ran the below query.
set hive.enforce.bucketing = true;
insert into user
select * from third_user;
Now another 4 files are crated under user folder dir. Now it has total 8 files.
Is this fine to do this kind of multiple inserts into a bucketed table? Does it affect the bucketing of the table?
I figured it out!! Actually if you do multiple inserts on a bucketed hive table. Hive wont complain as such. All hive queries will work fine.
Having said that, Such operation spoils the bucketing concept of the table. I mean after multiple inserts into a bucketed table the sampling fails.
The TABLASAMPLE doesnt work properly after multiple inserts.
Even sort merge bucket map join also doesnt work after such operation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With