I have a large log file which I loaded in to HDFS
. HDFS
will replicate to different nodes based on rack awareness.
Now I load the same file into a hive table. The commands are as below:
create table log_analysis (logtext string) STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/';
LOAD DATA INPATH '/user/log/apache.log' OVERWRITE INTO TABLE log_analysis;
Now when I go and see the '/user/hive/warehouse/' directory there is a table file and copying it to local, it has all the log file data.
My question is: the existing file in HDFS
is replicated. Then loading that file in hive table, stored on HDFS
also gets replicated.
Is that not the same file stored 6 times (assuming that replication factor is 3) ? That would be such a waste of resources.
Hive CREATE TABLE statement does not copy any data. Data remains in the location specified in the table DDL.
To remove duplicate values, you can use insert overwrite table in Hive using the DISTINCT keyword while selecting from the original table. The DISTINCT keyword returns unique records from the table.
select primary_key1, primary_key2, count(*) from mytable group by primary_key1, primary_key2 having count(*) > 1; Above query should list of rows which are duplicated and how many times particular row is duplicated.
Sometimes, we have a requirement to remove duplicate events from the hive table partition. There could be multiple ways to do it. Usually, it depends on the conditions based on which we want do it.
Correct, In case you are loading the data from HDFS
, the data moves from HDFS
to the /user/hive/warehouse/yourdatabasename/tablename
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With