Hive INSERT OVERWRITE DIRECTORY command output is not separated by a delimiter. Why?

Question

The file that I am loading is separated by ' ' (white space). Below is the file. The file resides in HDFS:-

1> I am creating an external table and loading the file by issuing the below command:-

CREATE EXTERNAL TABLE IF NOT EXISTS graph_edges (src_node_id STRING COMMENT 'Node ID of Source node', dest_node_id STRING COMMENT 'Node ID of Destination node') ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/hadoop/input';

2> After this, I am simply inserting the table in another file by issuing the below command:-

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT * FROM graph_edges;

3> Now, when I cat the file, the fields are not separated by any delimiter:-

hadoop dfs -cat /user/hadoop/output/000000_0

Output:-

Can somebody please help me out? Why is the delimiter being removed and how to delimit the output file?

In the CREATE TABLE command I tried DELIMITED BY ' ' but then I am getting unnecessary NULL column.

Any pointers help much appreciated. I am using Hive 0.9.0 version.

kgu87 · Accepted Answer

The problem is that HIVE does not allow you to specify the output delimiter - https://issues.apache.org/jira/browse/HIVE-634

The solution is to create external table for output (with delimiter specification) and insert overwrite table instead of directory.

--

Assuming that you have /user/hadoop/input/graph_edges.csv in HDFS,

hive> create external table graph_edges (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '
' 
    > stored as textfile location '/user/hadoop/input';

hive> select * from graph_edges;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

hive> create external table graph_out (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '
' 
    > stored as textfile location '/user/hadoop/output';

hive> insert into table graph_out select * from graph_edges;
hive> select * from graph_out;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

[user@box] hadoop fs -get /user/hadoop/output/000000_0 .

Comes back as above, with spaces.

Garren S · Answer

While the question is over 2 years old and the top answer was correct at the time, it is now possible to tell Hive to write delimited data to a directory.

Here is an example of outputting the data with the traditional ^A separator:

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
SELECT *
FROM data_schema.data_table

And now with tab delimiters:

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
row format delimited 
FIELDS TERMINATED BY '	'
SELECT *
FROM data_schema.data_table

Hive INSERT OVERWRITE DIRECTORY command output is not separated by a delimiter. Why?

Tags:

hadoop

hive

Anuroop

2 Answers

kgu87

Garren S

Recent Activity

Donate For Us

Hive INSERT OVERWRITE DIRECTORY command output is not separated by a delimiter. Why?

Tags:

hadoop

hive

Anuroop

2 Answers

kgu87

Garren S

Related questions

Recent Activity

Donate For Us