I have created a table in AWS Athena like this:
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '\"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://bucket/test/'
In the bucket I put a simple CSV file with the following context:
rec1 col1,rec2 col2
rec2 col1,"rec2, col2"
rec3 col1,"rec3
col2"
When I run data preview request SELECT * FROM "default"."test_line_breaks" limit 10;
then Athena returns the following response:
How should I set ROW FORMAT
to properly handle line breaks within the field values? So that rec3\ncol2
appears in col2
.
The problem here is that the OpenCSV Serializer-Deserializer
Does not support embedded line breaks in CSV files.
See this documentation from AWS.
However, it might be possible to use RegexSerDe. Just remember that this Deserializer will take "Java Flavored" Regex. So be sure to use an online Regex tool that supports that syntax in your debugging.
Edit: Still working on the syntax for dealing with the embedded line feed \n
. However, here is a sample that handles two columns with optional quotes. The following regex "*([^"]*)"*,"*([^"]*)"*
worked on your line with the embedded return carriage. However, I think the Presto Engine is only feeding it rec3 col1,"rec3
. I continue working on it.
CREATE EXTERNAL TABLE IF NOT EXISTS default.test_line_breaks (
col1 string,
col2 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = '"*([^"]*)"*,"*([^"]*)"*'
)
STORED AS TEXTFILE
LOCATION 's3://.../47936191';
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With