I have a very simple csv file on S3 <pre class="prettyprint"><code>"i","d","f","s" "1","2018-01-01","1.001","something great!" "2","2018-01-02","2.002","something terrible!" "3","2018-01-03","3.003","I'm an oil man" </code></pre> I'm trying to create a table across this using the following command <pre class="prettyprint"><code>CREATE EXTERNAL TABLE test (i int, d date, f float, s string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION 's3://mybucket/test/' TBLPROPERTIES ("skip.header.line.count"="1"); </code></pre> When I query the table (<code>select * from test</code>) I'm getting an error like this: <blockquote> HIVE_BAD_DATA: Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01" </blockquote> Some more info: <ul> <li>If I change the <code>d</code> column to a string the query will succeed</li> <li>I've previously parsed dates in text files using Athena; I believe using LazySimpleSerDe </li> <li>Definitely seems like a problem with the OpenCSVSerde</li> </ul> The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.

In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt: <blockquote> [OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG. </blockquote> Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind. Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970. For instance, your sample CSV should look like this: <pre class="prettyprint"><code>"i","d","f","s" "1","17532","1.001","something great!" "2","17533","2.002","something terrible!" "3","17534","3.003","I'm an oil man" </code></pre> Then you can run that exact same command: <pre class="prettyprint"><code>CREATE EXTERNAL TABLE test (i int, d date, f float, s string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION 's3://mybucket/test/' TBLPROPERTIES ("skip.header.line.count"="1"); </code></pre> If you query your Athena table with <code>select * from test</code>, you will get: <pre class="prettyprint"><code> i d f s --- ------------ ------- --------------------- 1 2018-01-01 1.001 something great! 2 2018-01-02 2.002 something terrible! 3 2018-01-03 3.003 I'm an oil man </code></pre> An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation: <blockquote> [OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the UNIX format, such as <code>yyyy-mm-dd hh:mm:ss[.f...]</code>, as the type LONG. </blockquote> It seems to indicate that we should format TIMESTAMPs as <code>yyyy-mm-dd hh:mm:ss[.f...]</code>. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970. For instance, consider the following sample CSV: <pre class="prettyprint"><code>"i","d","f","s","t" "1","17532","1.001","something great!","1564286638027" "2","17533","2.002","something terrible!","1564486638027" "3","17534","3.003","I'm an oil man","1563486638012" </code></pre> And the following CREATE TABLE statement: <pre class="prettyprint"><code>CREATE EXTERNAL TABLE test (i int, d date, f float, s string, t timestamp) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' LOCATION 's3://mybucket/test/' TBLPROPERTIES ("skip.header.line.count"="1"); </code></pre> This will be the result set for <code>select * from test</code>: <pre class="prettyprint"><code> i d f s t --- ------------ ------- --------------------- ------------------------- 1 2018-01-01 1.001 something great! 2019-07-28 04:03:58.027 2 2018-01-02 2.002 something terrible! 2019-07-30 11:37:18.027 3 2018-01-03 3.003 I'm an oil man 2019-07-18 21:50:38.012 </code></pre>

Athena unable to parse date using OpenCSVSerde

Tags:

csv

hive

amazon-athena

opencsv

presto

I have a very simple csv file on S3

"i","d","f","s"
"1","2018-01-01","1.001","something great!"
"2","2018-01-02","2.002","something terrible!"
"3","2018-01-03","3.003","I'm an oil man"

I'm trying to create a table across this using the following command

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

When I query the table (select * from test) I'm getting an error like this:

HIVE_BAD_DATA:
Error parsing field value '2018-01-01' for field 1: For input string: "2018-01-01"

Some more info:

If I change the d column to a string the query will succeed
I've previously parsed dates in text files using Athena; I believe using LazySimpleSerDe
Definitely seems like a problem with the OpenCSVSerde

The documentation definitely implies that this is supported. Looking for anyone who has encountered this, or any suggestions.

812

asked Sep 29 '18 01:09

Kirk Broadhurst

2 Answers

In fact, it is a problem with the documentation that you mentioned. You were probably referring to this excerpt:

[OpenCSVSerDe] recognizes the DATE type if it is specified in the UNIX format, such as YYYY-MM-DD, as the type LONG.

Understandably, you were formatting your date as YYYY-MM-DD. However, the documentation is deeply misleading in that sentence. When it refers to UNIX format, it actually has UNIX Epoch Time in mind.

Based on the definition of UNIX Epoch, your dates should be integers (hence the reference to the type LONG in the documentation). Your dates should be the number of days that have elapsed since January 1, 1970.

For instance, your sample CSV should look like this:

"i","d","f","s"
"1","17532","1.001","something great!"
"2","17533","2.002","something terrible!"
"3","17534","3.003","I'm an oil man"

Then you can run that exact same command:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

If you query your Athena table with select * from test, you will get:

  i       d          f              s           
 --- ------------ ------- --------------------- 
  1   2018-01-01   1.001   something great!     
  2   2018-01-02   2.002   something terrible!  
  3   2018-01-03   3.003   I'm an oil man

An analogous problem also compromises the explanation on TIMESTAMP in the aforementioned documentation:

[OpenCSVSerDe] recognizes the TIMESTAMP type if it is specified in the UNIX format, such as yyyy-mm-dd hh:mm:ss[.f...], as the type LONG.

It seems to indicate that we should format TIMESTAMPs as yyyy-mm-dd hh:mm:ss[.f...]. Not really. In fact, we need to use UNIX Epoch Time again, but this time with the number of milliseconds that have elapsed since Midnight 1 January 1970.

For instance, consider the following sample CSV:

"i","d","f","s","t"
"1","17532","1.001","something great!","1564286638027"
"2","17533","2.002","something terrible!","1564486638027"
"3","17534","3.003","I'm an oil man","1563486638012"

And the following CREATE TABLE statement:

CREATE EXTERNAL TABLE test (i int, d date, f  float, s string, t timestamp)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
LOCATION 's3://mybucket/test/'
TBLPROPERTIES ("skip.header.line.count"="1");

This will be the result set for select * from test:

  i       d          f              s                       t             
 --- ------------ ------- --------------------- ------------------------- 
  1   2018-01-01   1.001   something great!      2019-07-28 04:03:58.027  
  2   2018-01-02   2.002   something terrible!   2019-07-30 11:37:18.027  
  3   2018-01-03   3.003   I'm an oil man        2019-07-18 21:50:38.012

151

answered Sep 22 '22 15:09

Alexandre

One way around is declare the d column as string and then in the select query use DATE(d) or date_parse to parse the value as date data type.

answered Sep 21 '22 15:09

Tanveer Uddin

Related questions
                            
                                Is it possible to use read_csv to read only specific lines?
                            
                                Use Python to write on specific columns in csv file
                            
                                How can I remove carriage return from a text file with Python?
                            
                                Read from CSV file and create arrays in Objective-C Xcode
                            
                                How to forcefully download .csv file instead of getting in open on browser in html
                            
                                How to find the total lines of csv files in a directory on Linux?
                            
                                copy csv postgres ignore rows that violate constraints
                            
                                Can I use Text::CSV_XS to parse a csv-format string without writing it to disk?
                            
                                Fastest way to import CSV files in MATLAB
                            
                                write CSV columns out in a different order in Python
                            
                                Importing large CSV into mysql database
                            
                                How to read a text file into GNU R with a multiple-byte separator?
                            
                                Python Writing a numpy array to a CSV File [duplicate]
                            
                                The Python CSV writer is adding letters to the beginning of each element and issues with encode
                            
                                Python, how to write nested list with unequal lengths to a csv file?
                            
                                Read specific columns in csv using python
                            
                                write float list to csv file
                            
                                Iterating through DictReader
                            
                                Why does this code raise csv.Error?
                            
                                NoSuchMethodException: java.time.LocalDateTime.<init>() reading CSV using Super CSV

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With