Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle fields enclosed within quotes(CSV) in importing data from S3 into DynamoDB using EMR/Hive

I am trying to use EMR/Hive to import data from S3 into DynamoDB. My CSV file has fields which are enclosed within double quotes and separated by comma. While creating external table in hive, I am able to specify delimiter as comma but how do I specify that fields are enclosed within quotes?

If I don’t specify, I see that values in DynamoDB are populated within two double quotes ““value”” which seems to be wrong.

I am using following command to create external table. Is there a way to specify that fields are enclosed within double quotes?

CREATE EXTERNAL TABLE emrS3_import_1(col1 string, col2 string, col3 string, col4 string)  ROW FORMAT DELIMITED FIELDS TERMINATED BY '","' LOCATION 's3://emrTest/folder';

Any suggestions would be appreciated. Thanks Jitendra

like image 251
RandomQuestion Avatar asked Dec 27 '12 21:12

RandomQuestion


People also ask

How do you escape double quotes in Hive?

The pipe occurring within data fields are enclosed within quotes. Double quotes occurring within data are escaped with \ .

What is Hive character quote?

The connector queries the driver to determine the quotation mark that is used by the Hive data source. If the connector fails to obtain this information, it uses a backtick ( ` ) character as the quotation mark by default.

Can Hive query DynamoDB?

Hive can read and write data in DynamoDB tables, allowing you to: Query live DynamoDB data using a SQL-like language (HiveQL). Copy data from a DynamoDB table to an Amazon S3 bucket, and vice-versa.

Which of the library can be used to read UTF 8 format file of HDFS and S3?

To read non-printable UTF-8 character data in Hive A SequenceFile is Hadoop binary file format; you need to use Hadoop to read this file.


2 Answers

I was also stuck with the same issue as my fields are enclosed with double quotes and separated by semicolon(;). My table name is employee1.

So I have searched with links and I have found perfect solution for this.

We have to use serde for this. Please download serde jar using this link : https://github.com/downloads/IllyaYalovyy/csv-serde/csv-serde-0.9.1.jar

then follow below steps using hive prompt :

add jar path/to/csv-serde.jar;

create table employee1(id string, name string, addr string)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties(
"separatorChar" = "\;",
"quoteChar" = "\"")
stored as textfile
;

and then load data from your given path using below query:

load data local inpath 'path/xyz.csv' into table employee1;

and then run :

select * from employee1;

Now you will see the magic. Thanks.

like image 63
Cast_A_Way Avatar answered Sep 16 '22 15:09

Cast_A_Way


Following code solved same type of problem

CREATE TABLE TableRowCSV2(    
    CODE STRING,        
    PRODUCTCODE STRING, 
    PRICE STRING     
)
    COMMENT 'row data csv'    
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

WITH SERDEPROPERTIES (
   "separatorChar" = "\,",
   "quoteChar"     = "\""
)
STORED AS TEXTFILE
tblproperties("skip.header.line.count"="1");
like image 44
Shankar Avatar answered Sep 18 '22 15:09

Shankar