Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load data to hive from HDFS without removing the source file?

Tags:

hadoop

hive

When load data from HDFS to Hive, using

LOAD DATA INPATH 'hdfs_file' INTO TABLE tablename; 

command, it looks like it is moving the hdfs_file to hive/warehouse dir. Is it possible (How?) to copy it instead of moving it, in order, for the file, to be used by another process.

like image 397
Suge Avatar asked Sep 27 '11 10:09

Suge


People also ask

How do I load a dataset into Hive?

LOAD DATA [LOCAL] INPATH '<The table data location>' [OVERWRITE] INTO TABLE <table_name>; Note: The LOCAL Switch specifies that the data we are loading is available in our Local File System. If the LOCAL switch is not used, the hive will consider the location as an HDFS path location.

What is the best way to load XML data into Hive?

In this, we are going to load XML data into Hive tables, and we will fetch the values stored inside the XML tags. Step 1) Creation of Table “xmlsample_guru” with str column with string data type. Step 2) Using XPath () method we will be able to fetch the data stored inside XML tags.


2 Answers

from your question I assume that you already have your data in hdfs. So you don't need to LOAD DATA, which moves the files to the default hive location /user/hive/warehouse. You can simply define the table using the externalkeyword, which leaves the files in place, but creates the table definition in the hive metastore. See here: Create Table DDL eg.:

create external table table_name (   id int,   myfields string ) location '/my/location/in/hdfs'; 

Please note that the format you use might differ from the default (as mentioned by JigneshRawal in the comments). You can use your own delimiter, for example when using Sqoop:

row format delimited fields terminated by ',' 
like image 87
Dag Avatar answered Oct 22 '22 02:10

Dag


I found that, when you use EXTERNAL TABLE and LOCATION together, Hive creates table and initially no data will present (assuming your data location is different from the Hive 'LOCATION').

When you use 'LOAD DATA INPATH' command, the data get MOVED (instead of copy) from data location to location that you specified while creating Hive table.

If location is not given when you create Hive table, it uses internal Hive warehouse location and data will get moved from your source data location to internal Hive data warehouse location (i.e. /user/hive/warehouse/).

like image 25
Avinash Avatar answered Oct 22 '22 04:10

Avinash