Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle embed line breaks in AWS Athena

I have created a table in AWS Athena like this:

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
 'separatorChar' = ',',
 'quoteChar' = '\"',
 'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'

In the bucket I put a simple CSV file with the following context:

rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"

When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10; then Athena returns the following response: Athena query response

How should I set ROW FORMAT to properly handle line breaks within the field values? So that rec3\ncol2 appears in col2.

like image 617
antklim Avatar asked Dec 22 '17 05:12

antklim


1 Answers

The problem here is that the OpenCSV Serializer-Deserializer

Does not support embedded line breaks in CSV files.

See this documentation from AWS.

However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.

Edit: Still working on the syntax for dealing with the embedded line feed \n. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"* worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3. I continue working on it.

CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
  col1 string, 
  col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';
like image 79
Zerodf Avatar answered Oct 21 '22 10:10

Zerodf