Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to import data into Hive table without copying the data

Tags:

hadoop

hive

hdfs

I have log files stored as text in HDFS. When I load the log files into a Hive table, all the files are copied.

Can I avoid having all my text data stored twice?

EDIT: I load it via the following command

LOAD DATA INPATH '/user/logs/mylogfile' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Then, I can find the exact same file in:

/user/hive/warehouse/sandbox.db/test/day=20130220

I assumed it was copied.

like image 463
Mad Echet Avatar asked Mar 07 '13 12:03

Mad Echet


People also ask

How manually insert data into Hive table?

Syntax: INSERT INTO TABLE <table_name> VALUES (<add values as per column entity>); Example: To insert data into the table let's create a table with the name student (By default hive uses its default database to store hive tables).

How do I import data into Hive?

Navigate to the file you want to import, right-click it, select Import into Hive, and select how to import it: Import as CSV, Import as Apache Avro, or Import as Apache Parquet.

How do I import data from Excel to Hive table?

Hive doesn't support EXCEL format directly, so you have to convert excel files to a delimited format file, then use load command to upload the file into Hive(or HDFS).

Can we update data inside Hive table?

You use the UPDATE statement to modify data already stored in an Apache Hive table. You construct an UPDATE statement using the following syntax: UPDATE tablename SET column = value [, column = value ...] [WHERE expression];


3 Answers

I can say, instead of copying data by your java application directly to HDFS, have those file in local file system, and import them into HDFS via hive using following command.

LOAD DATA LOCAL INPATH '/your/local/filesystem/file.csv' INTO TABLE `sandbox.test` PARTITION (day='20130221')

Notice the LOCAL

like image 99
Abimaran Kugathasan Avatar answered Oct 16 '22 16:10

Abimaran Kugathasan


use an external table:

CREATE EXTERNAL TABLE sandbox.test(id BIGINT, name STRING) ROW FORMAT
              DELIMITED FIELDS TERMINATED BY ','
              LINES TERMINATED BY '\n' 
              STORED AS TEXTFILE
              LOCATION '/user/logs/';

if you want to use partitioning with an external table, you will be responsible for managing the partition directories. the location specified must be an hdfs directory..

If you drop an external table hive WILL NOT delete the source data. If you want to manage your raw files, use external tables. If you want hive to do it, the let hive store inside of its warehouse path.

like image 35
cran1um Avatar answered Oct 16 '22 17:10

cran1um


You can use alter table partition statement to avoid data duplication.

create External table if not exists TestTable (testcol string) PARTITIONED BY (year INT,month INT,day INT) row format delimited fields terminated by ',';

ALTER table TestTable partition (year='2014',month='2',day='17') location 'hdfs://localhost:8020/data/2014/2/17/';
like image 1
Chetan Shirke Avatar answered Oct 16 '22 18:10

Chetan Shirke