Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hive load CSV with commas in quoted fields

I am trying to load a CSV file into a Hive table like so:

CREATE TABLE mytable ( num1 INT, text1 STRING, num2 INT, text2 STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";  LOAD DATA LOCAL INPATH '/data.csv' OVERWRITE INTO TABLE mytable;     


The csv is delimited by an comma (,) and looks like this:

1, "some text, with comma in it", 123, "more text" 

This will return corrupt data since there is a ',' in the first string.
Is there a way to set an text delimiter or make Hive ignore the ',' in strings?

I can't change the delimiter of the csv since it gets pulled from an external source.

like image 680
Martijn Lenderink Avatar asked Nov 29 '12 15:11

Martijn Lenderink


People also ask

Can CSV values have commas?

A comma-separated values (CSV) file is a delimited text file that uses a comma to separate values. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

What is row format delimited in Hive?

ROW FORMAT should have delimiters used to terminate the fields and lines like in the above example the fields are terminated with comma (“,”). The default location of Hive table is overwritten by using LOCATION. So the data now is stored in data/weather folder inside hive.

How does CSV handle extra commas in Java?

You need to specify text qualifiers. Generally a double quote (") is used as text qualifiers. All the text is always put inside it and all the commas inside a text qualifier is ignored. This is a standard method for all CSV, languages and all platforms for properly handling the text.


2 Answers

If you can re-create or parse your input data, you can specify an escape character for the CREATE TABLE:

ROW FORMAT DELIMITED FIELDS TERMINATED BY "," ESCAPED BY '\\'; 

Will accept this line as 4 fields

1,some text\, with comma in it,123,more text 
like image 115
libjack Avatar answered Sep 24 '22 16:09

libjack


The problem is that Hive doesn't handle quoted texts. You either need to pre-process the data by changing the delimiter between the fields (e.g: with a Hadoop-streaming job) or you can also give a try to use a custom CSV SerDe which uses OpenCSV to parse the files.

like image 29
Lorand Bendig Avatar answered Sep 21 '22 16:09

Lorand Bendig