I am parsing csv file using AWS athena from java code. Some columns in csv are of date type and one column has comma in the value.
If the athena table is created with
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
then it is unable to parse the column with comma correctly
However it parses correctly if I use
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
But the issue with OpenCSVSerde
is that it requires all columns to be of string data type and I need to carry out date operations in the query so can't use OpenCSVSerde
.
Any other solution? Please help!
That's how this two SerDes are designed, you should only use the LazySimpleSerDe
in cases when your data is relatively clean, for example, it does not have values enclosed in quotes or does not have delimiters in the value. And OpenCSVSerde
works well for deserializing CSV files that have values enclosed in quotes; however, all columns in the table are of STRING data type. More info here
So in your case, as your data is not clean, the only way to parse it and have loaded into Athena is to use OpenCSVSerde
. And if you need to use date operations you need to manually convert/parse the date strings into a date object, which is fairly easy to do with date_parse
function.
So say if you have following string data in your date type column:
11/13/2017
11/14/2017
11/15/2017
11/16/2017
You can use the following query to select date in the range
select * from somedb.sometable where date_parse(createdate, '%m/%d/%Y') between DATE'2017-11-14' and DATE'2017-11-16';
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With