Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does Hive duplicate data?

Tags:

hive

I have a large log file which I loaded in to HDFS. HDFS will replicate to different nodes based on rack awareness.

Now I load the same file into a hive table. The commands are as below:

create table log_analysis (logtext string) STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/';

LOAD DATA INPATH '/user/log/apache.log' OVERWRITE INTO TABLE log_analysis;

Now when I go and see the '/user/hive/warehouse/' directory there is a table file and copying it to local, it has all the log file data.

My question is: the existing file in HDFS is replicated. Then loading that file in hive table, stored on HDFS also gets replicated.

Is that not the same file stored 6 times (assuming that replication factor is 3) ? That would be such a waste of resources.

like image 743
sakshi Avatar asked Nov 06 '15 04:11

sakshi


People also ask

Does hive copy data?

Hive CREATE TABLE statement does not copy any data. Data remains in the location specified in the table DDL.

How do I prevent duplicates in hive?

To remove duplicate values, you can use insert overwrite table in Hive using the DISTINCT keyword while selecting from the original table. The DISTINCT keyword returns unique records from the table.

How do you check if there are duplicates in hive?

select primary_key1, primary_key2, count(*) from mytable group by primary_key1, primary_key2 having count(*) > 1; Above query should list of rows which are duplicated and how many times particular row is duplicated.

What is deduplication in hive?

Sometimes, we have a requirement to remove duplicate events from the hive table partition. There could be multiple ways to do it. Usually, it depends on the conditions based on which we want do it.


1 Answers

Correct, In case you are loading the data from HDFS , the data moves from HDFS to the /user/hive/warehouse/yourdatabasename/tablename.

like image 195
Biswanath Roy Avatar answered Sep 23 '22 22:09

Biswanath Roy