Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read quoted CSV with NULL values into Amazon Athena

I'm trying to create an external table in Athena using quoted CSV file stored on S3. The problem is, that my CSV contain missing values in columns that should be read as INTs. Simple example:

CSV:

id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"

CREATE TABLE DEFINITION:

CREATE EXTERNAL TABLE schema.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT 
SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ",",
'quoteChar' = '"',
'skip.header.line.count' = '1'
)
STORED AS TEXTFILE
LOCATION 's3://mybucket/test_null/unquoted/'

CREATE TABLE statement runs fine but as soon as I try to query the table, I'm getting HIVE_BAD_DATA: Error parsing field value ''.

I tried making the CSV look like this (quote empty string):

"id","height","age","name"
1,"",26,"Adam"
2,178,28,"Robert"

But it's not working.

Tried specifying 'serialization.null.format' = '' in SERDEPROPERTIES - not working.

Tried specifying the same via TBLPROPERTIES ('serialization.null.format'='') - still nothing.

It works, when you specify all columns as STRING but that's not what I need.

Therefore, the question is, is there any way to read a quoted CSV (quoting is important as my real data is much more complex) to Athena with correct column specification?

like image 793
Mikolaj Avatar asked Jun 06 '18 15:06

Mikolaj


2 Answers

Unfortunately there is no way to get both support for quoted fields and support for null values in Athena. You have to choose either or.

You can use OpenCSVSerDe and type all columns as string, that will give you support for quoted fields, and emtpty strings for empty fields. Cast values at query time using TRY_CAST or CASE/WHEN.

Or you can use LazySimpleSerDe and strip quotes at query time.

I would go for OpenCSVSerDe because you can always create a view with all the type conversion and use the view for your regular queries.

You can read all the nitty-gritty details of working with CSV in Athena here: The Athena Guide: Working with CSV

like image 101
Theo Avatar answered Oct 31 '22 11:10

Theo


Quick and dirty way to handle these data:

CSV:

id,height,age,name
1,,26,"Adam"
2,178,28,"Robert"
3,123,34,"Bill, Comma"
4,183,38,"Alex"

DDL:

CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n' -- Or use Windows Line Endings
LOCATION 's3://XXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1')
;

The issue is that it is not handling the quote characters in the last field. Based on the documentation provided by AWS, this makes sense as the LazySimpleSerDe given the following from Hive.

I suspect the solution is using the following SerDe org.apache.hadoop.hive.serde2.RegexSerDe.

I will work on the regex later.

Edit:

Regex as promised:

CREATE EXTERNAL TABLE stackoverflow.test_null_unquoted (
  id INT,
  height INT,
  age INT,
  name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.*),(.*),(.*),\"(.*)\""
)
LOCATION 's3://XXXXXXXXXXXXXXX/'
TBLPROPERTIES ('skip.header.line.count'='1') -- Does not appear to work
;

enter image description here

Note: RegexSerDe did not seem to work properly with TBLPROPERTIES ('skip.header.line.count'='1'). That could be due to the Hive version used by Athena or the SerDe. In your case, you can likely just exclude rows where ID IS NULL.

Further Reading:

Stackoverflow - remove surrounding quotes from fields while loading data into hive

Athena - OpenCSVSerDe for Processing CSV

like image 5
Zerodf Avatar answered Oct 31 '22 11:10

Zerodf