I have a large log file which I loaded in to <code>HDFS</code>. <code>HDFS</code> will replicate to different nodes based on rack awareness. Now I load the same file into a hive table. The commands are as below: <pre class="prettyprint"><code>create table log_analysis (logtext string) STORED AS TEXTFILE LOCATION '/user/hive/warehouse/'; LOAD DATA INPATH '/user/log/apache.log' OVERWRITE INTO TABLE log_analysis; </code></pre> Now when I go and see the '/user/hive/warehouse/' directory there is a table file and copying it to local, it has all the log file data. My question is: the existing file in <code>HDFS</code> is replicated. Then loading that file in hive table, stored on <code>HDFS</code> also gets replicated. Is that not the same file stored 6 times (assuming that replication factor is 3) ? That would be such a waste of resources.

Correct, In case you are loading the data from <code>HDFS</code> , the data moves from <code>HDFS</code> to the <code>/user/hive/warehouse/yourdatabasename/tablename</code>.

Does Hive duplicate data?

Tags:

hive

I have a large log file which I loaded in to HDFS. HDFS will replicate to different nodes based on rack awareness.

Now I load the same file into a hive table. The commands are as below:

create table log_analysis (logtext string) STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/';

LOAD DATA INPATH '/user/log/apache.log' OVERWRITE INTO TABLE log_analysis;

Now when I go and see the '/user/hive/warehouse/' directory there is a table file and copying it to local, it has all the log file data.

My question is: the existing file in HDFS is replicated. Then loading that file in hive table, stored on HDFS also gets replicated.

Is that not the same file stored 6 times (assuming that replication factor is 3) ? That would be such a waste of resources.

743

asked Nov 06 '15 04:11

sakshi

1 Answers

Correct, In case you are loading the data from HDFS , the data moves from HDFS to the /user/hive/warehouse/yourdatabasename/tablename.

195

answered Sep 23 '22 22:09

Biswanath Roy

Related questions
                            
                                What does 'insert overwrite local directory' mean in Hive?
                            
                                How to remove milliseconds in timestamp spark sql
                            
                                MSCK REPAIR hive external tables
                            
                                Apache Spark 2.3.1 with Hive metastore 3.1.0
                            
                                What's the best way to support array column types with external tables in hive?
                            
                                Distinct on specific column in Hive
                            
                                $HIVE_HOME/bin/hive --service hiveserver
                            
                                is is possible to count the number of partitions?
                            
                                How do I search for an item in an array in Hive?
                            
                                Subquery in `where` with comparison operator
                            
                                automatically partition Hive tables based on S3 directory names
                            
                                How to configure Hive warehouse path?
                            
                                Schema Evolution in Parquet Hive table
                            
                                How to Append new data to already existing hive table
                            
                                How to use rbhive gem and query hive
                            
                                How to create a table in Hive with a column of data type array<map<string, string>>
                            
                                Hive Managed Table vs External Table : LOCATION directory
                            
                                Hive: parsing JSON
                            
                                How can select a column and do a TRANSFORM in Hive?
                            
                                Exporting Hive Table to a S3 bucket

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With