Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CSV parse using aws athena

I am parsing csv file using AWS athena from java code. Some columns in csv are of date type and one column has comma in the value.

If the athena table is created with

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

then it is unable to parse the column with comma correctly

However it parses correctly if I use

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'

But the issue with OpenCSVSerde is that it requires all columns to be of string data type and I need to carry out date operations in the query so can't use OpenCSVSerde.

Any other solution? Please help!

like image 651
cooldev Avatar asked Jun 19 '17 01:06

cooldev


1 Answers

That's how this two SerDes are designed, you should only use the LazySimpleSerDe in cases when your data is relatively clean, for example, it does not have values enclosed in quotes or does not have delimiters in the value. And OpenCSVSerde works well for deserializing CSV files that have values enclosed in quotes; however, all columns in the table are of STRING data type. More info here

So in your case, as your data is not clean, the only way to parse it and have loaded into Athena is to use OpenCSVSerde. And if you need to use date operations you need to manually convert/parse the date strings into a date object, which is fairly easy to do with date_parse function.

So say if you have following string data in your date type column:

11/13/2017
11/14/2017
11/15/2017
11/16/2017

You can use the following query to select date in the range

select * from somedb.sometable where date_parse(createdate, '%m/%d/%Y') between DATE'2017-11-14' and DATE'2017-11-16';
like image 167
Babl Avatar answered Nov 13 '22 12:11

Babl