Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Athena unable to parse date using OpenCSVSerde

I have a very simple csv file on S3

"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"

I'm trying to create a table across this using the following command

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

When I query the table (select * from test) I'm getting an error like this:

HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"

Some more info:

  • If I change the d column to a string the query will succeed
  • I've previously parsed dates in text files using Athena; I believe using LazySimpleSerDe
  • Definitely seems like a problem with the OpenCSVSerde

The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.

like image 812
Kirk Broadhurst Avatar asked Sep 29 '18 01:09

Kirk Broadhurst


People also ask

What is OpenCSVSerDe?

CSV SerDe (OpenCSVSerDe) The OpenCSV SerDe has the following characteristics for string data: Uses double quotes ( " ) as the default quote character, and allows you to specify separator, quote, and escape characters, such as: WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )

What is lazy simple SerDe?

This is the SerDe for data in CSV, TSV, and custom-delimited formats that Athena uses by default. This SerDe is used if you don't specify any SerDe and only specify ROW FORMAT DELIMITED . Use this SerDe if your data does not have values enclosed in quotes.

Can Athena query CSV?

CSV is the only output format used by the Athena SELECT query, but you can use UNLOAD to write the output of a SELECT query to the formats that UNLOAD supports. Although you can use the CTAS statement to output data in formats other than CSV, those statements also require the creation of a table in Athena.


2 Answers

In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:

[OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG.

Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.

Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.

For instance, your sample CSV should look like this:

"i","d","f","s"
"1","17532","1.001","something great!"
"2","17533","2.002","something terrible!"
"3","17534","3.003","I'm an oil man"

Then you can run that exact same command:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

If you query your Athena table with select * from test, you will get:

  i       d          f              s           
 --- ------------ ------- --------------------- 
  1   2018-01-01   1.001   something great!     
  2   2018-01-02   2.002   something terrible!  
  3   2018-01-03   3.003   I'm an oil man    

An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:

[OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the UNIX format, such as yyyy-mm-dd hh:mm:ss[.f...], as the type LONG.

It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.

For instance, consider the following sample CSV:

"i","d","f","s","t"
"1","17532","1.001","something great!","1564286638027"
"2","17533","2.002","something terrible!","1564486638027"
"3","17534","3.003","I'm an oil man","1563486638012"

And the following CREATE TABLE statement:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string, t timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

This will be the result set for select * from test:

  i       d          f              s                       t             
 --- ------------ ------- --------------------- ------------------------- 
  1   2018-01-01   1.001   something great!      2019-07-28 04:03:58.027  
  2   2018-01-02   2.002   something terrible!   2019-07-30 11:37:18.027  
  3   2018-01-03   3.003   I'm an oil man        2019-07-18 21:50:38.012  
like image 151
Alexandre Avatar answered Sep 22 '22 15:09

Alexandre


One way around is declare the d column as string and then in the select query use DATE(d) or date_parse to parse the value as date data type.

like image 38
Tanveer Uddin Avatar answered Sep 21 '22 15:09

Tanveer Uddin