I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied. Can I avoid having all my text data stored twice? EDIT: I load it via the following command <pre class="prettyprint"><code>LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221') </code></pre> Then, I can find the exact same file in: <pre class="prettyprint"><code>/user/hive/warehouse/sandbox.db/test/day=20130220 </code></pre> I assumed it was copied.

I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command. <pre class="prettyprint"><code>LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221') </code></pre> Notice the <code>LOCAL</code>

Is it possible to import data into Hive table without copying the data

Tags:

hadoop

hive

hdfs

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.

Can I avoid having all my text data stored twice?

EDIT: I load it via the following command

LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Then, I can find the exact same file in:

/user/hive/warehouse/sandbox.db/test/day=20130220

I assumed it was copied.

463

asked Mar 07 '13 12:03

Mad Echet

3 Answers

I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.

LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Notice the LOCAL

answered Oct 16 '22 16:10

Abimaran Kugathasan

use an external table:

CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/user/logs/';

if you want to use partitioning with an external table, you will be responsible for managing the partition directories. the location specified must be an hdfs directory..

If you drop an external table hive WILL NOT delete the source data. If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.

answered Oct 16 '22 17:10

cran1um

You can use alter table partition statement to avoid data duplication.

create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';

ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';

answered Oct 16 '22 18:10

Chetan Shirke

Related questions
                            
                                Differences between Hadoop-common, Hadoop-core and Hadoop-client?
                            
                                overwrite hive partitions using spark
                            
                                Global variables in hadoop
                            
                                A way to export the results from Pig to a database
                            
                                Find the average of numbers using MapReduce
                            
                                How to use Hadoop InputFormats In Apache Spark?
                            
                                Hadoop MapReduce: Clarification on number of reducers
                            
                                What is the difference between hadoop job -kill job_id and yarn application -kill application_id
                            
                                localhost: ERROR: Cannot set priority of datanode process 32156
                            
                                Hadoop on Kubernetes vs Standard Hadoop
                            
                                java.io.IOException: Incompatible clusterIDs
                            
                                how to order my tuple of spark results descending order using value
                            
                                Setting YARN queue in PySpark
                            
                                CAP with distributed System
                            
                                How to copy first few lines of a large file in hadoop to a new file?
                            
                                Could you give me any clue Why 'Cannot call methods on a stopped SparkContext'?
                            
                                How to find Hadoop hdfs directory on my system?
                            
                                Running jobs parallely in hadoop
                            
                                How to import org.apache Java dependencies w/ or w/o Maven
                            
                                dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With