Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive INSERT OVERWRITE DIRECTORY command output is not separated by a delimiter. Why?

Tags:

hadoop

hive

The file that I am loading is separated by ' ' (white space). Below is the file. The file resides in HDFS:-

001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

1> I am creating an external table and loading the file by issuing the below command:-

CREATE EXTERNAL TABLE IF NOT EXISTS graph_edges (src_node_id STRING COMMENT 'Node ID of Source node', dest_node_id STRING COMMENT 'Node ID of Destination node') ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/hadoop/input';

2> After this, I am simply inserting the table in another file by issuing the below command:-

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT * FROM graph_edges;

3> Now, when I cat the file, the fields are not separated by any delimiter:-

hadoop dfs -cat /user/hadoop/output/000000_0

Output:-

001000
001000
002001
003002
004003
005004
006005
007006
008007
099007

Can somebody please help me out? Why is the delimiter being removed and how to delimit the output file?

In the CREATE TABLE command I tried DELIMITED BY '\t' but then I am getting unnecessary NULL column.

Any pointers help much appreciated. I am using Hive 0.9.0 version.

like image 209
Anuroop Avatar asked May 09 '13 10:05

Anuroop


2 Answers

The problem is that HIVE does not allow you to specify the output delimiter - https://issues.apache.org/jira/browse/HIVE-634

The solution is to create external table for output (with delimiter specification) and insert overwrite table instead of directory.

--

Assuming that you have /user/hadoop/input/graph_edges.csv in HDFS,

hive> create external table graph_edges (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '\n' 
    > stored as textfile location '/user/hadoop/input';

hive> select * from graph_edges;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

hive> create external table graph_out (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '\n' 
    > stored as textfile location '/user/hadoop/output';

hive> insert into table graph_out select * from graph_edges;
hive> select * from graph_out;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

[user@box] hadoop fs -get /user/hadoop/output/000000_0 .

Comes back as above, with spaces.

like image 187
kgu87 Avatar answered Nov 01 '22 12:11

kgu87


While the question is over 2 years old and the top answer was correct at the time, it is now possible to tell Hive to write delimited data to a directory.

Here is an example of outputting the data with the traditional ^A separator:

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
SELECT *
FROM data_schema.data_table

And now with tab delimiters:

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
row format delimited 
FIELDS TERMINATED BY '\t'
SELECT *
FROM data_schema.data_table
like image 32
Garren S Avatar answered Nov 01 '22 11:11

Garren S