I have a very simple csv file on S3
"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"
I'm trying to create a table across this using the following command
CREATE EXTERNAL TABLE test (i int, d date, f float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
When I query the table (select * from test
) I'm getting an error like this:
HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"
Some more info:
d
column to a string the query will succeedThe documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.
CSV SerDe (OpenCSVSerDe) The OpenCSV SerDe has the following characteristics for string data: Uses double quotes ( " ) as the default quote character, and allows you to specify separator, quote, and escape characters, such as: WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
This is the SerDe for data in CSV, TSV, and custom-delimited formats that Athena uses by default. This SerDe is used if you don't specify any SerDe and only specify ROW FORMAT DELIMITED . Use this SerDe if your data does not have values enclosed in quotes.
CSV is the only output format used by the Athena SELECT query, but you can use UNLOAD to write the output of a SELECT query to the formats that UNLOAD supports. Although you can use the CTAS statement to output data in formats other than CSV, those statements also require the creation of a table in Athena.
In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:
[OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG.
Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.
Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.
For instance, your sample CSV should look like this:
"i","d","f","s"
"1","17532","1.001","something great!"
"2","17533","2.002","something terrible!"
"3","17534","3.003","I'm an oil man"
Then you can run that exact same command:
CREATE EXTERNAL TABLE test (i int, d date, f float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
If you query your Athena table with select * from test
, you will get:
i d f s
--- ------------ ------- ---------------------
1 2018-01-01 1.001 something great!
2 2018-01-02 2.002 something terrible!
3 2018-01-03 3.003 I'm an oil man
An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:
[OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the UNIX format, such as
yyyy-mm-dd hh:mm:ss[.f...]
, as the type LONG.
It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]
. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.
For instance, consider the following sample CSV:
"i","d","f","s","t"
"1","17532","1.001","something great!","1564286638027"
"2","17533","2.002","something terrible!","1564486638027"
"3","17534","3.003","I'm an oil man","1563486638012"
And the following CREATE TABLE statement:
CREATE EXTERNAL TABLE test (i int, d date, f float, s string, t timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");
This will be the result set for select * from test
:
i d f s t
--- ------------ ------- --------------------- -------------------------
1 2018-01-01 1.001 something great! 2019-07-28 04:03:58.027
2 2018-01-02 2.002 something terrible! 2019-07-30 11:37:18.027
3 2018-01-03 3.003 I'm an oil man 2019-07-18 21:50:38.012
One way around is declare the d column as string and then in the select query use DATE(d) or date_parse to parse the value as date data type.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With