Amazon Redshift doesn't really have a CSV option when using UNLOAD
(e.g., similar to the WITH CSV
available in PostgreSQL's COPY
command).
If you use the ESCAPE
option, Redshift escapes double quotes with a \
. Excel, for example, doesn't know what to do with that slash. It should really be using a double quote, per RFC-4180
If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:
"aaa","b""bb","ccc"
Sooo... quotes are going to break exports unless we can figure out a way to make Redshift escape them properly.
Given the following data:
# select * from unload_test;
test_field | test_field_number
--------------------------------------------+-------------------
"No," she said, "that's not how we do it." | 5
Hi! This is a test, yo. | 5
ESCAPE
but no ADDQUOTES
, Excel et al continues to break columns on all commas including the "escaped" ones (\,
). Example
ESCAPE
and ADDQUOTES
you get the result you see in the initial issue example.ADDQUOTES
but we don't use ESCAPE
then it will break on double-quotes in the data. Example
Is there an available work-around for this that will accommodate both quotes and commas in the data?
Unload VENUE to a CSV file The following example unloads the VENUE table and writes the data in CSV format to s3://mybucket/unload/ . unload ('select * from venue') to 's3://mybucket/unload/' iam_role 'arn:awsawsAmazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide distributed computing processing capacity and software tools via AWS server farms.https://en.wikipedia.org › wiki › Amazon_Web_ServicesAmazon Web Services - Wikipedia:iam::0123456789012:role/MyRedshiftRole' CSV; Suppose that the VENUE table contains the following rows.
Yes, because thats the naming convention unload uses in order to avoid duplicate names which I believe cannot be avoided from our end.
To unload data from database tables to a set of files in an Amazon S3 bucket, you can use the UNLOAD command with a SELECT statement. You can unload text data in either delimited format or fixed-width format, regardless of the data format that was used to load it.
Scaling solution was to process contents using perl / sed and convert it to proper CSV.
These flags are required: DELIMITER ',' ADDQUOTES ESCAPE
.
aws s3 cp s3://mybucket/non_csv/part_0000 - | \
# Unescapes \" into "" but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | \
# Removes \ but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | \
# Unescapes \\ into \
sed -e 's/\\\\/\\/g' | \
aws s3 cp - s3://mybucket/csv/part_0000
I tested this using various edge cases on a real noisy data and it produces correct results. All backslashes are removed because nothing must be escaped except "
-> ""
and added quoting.
Example test:
$ echo '"","\"\"Hi\\\\\"","\\\\","\<tab>"' | perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | sed -e 's/\\\\/\\/g'
Output:
"","""""Hi\\""","\\","<tab>"
PERL or other advanced regexp engine is required to do a correct look-behind.
Performance was not an issue, because I hit network bandwidth instead of CPU. However, one can use GNU parallel to process all output files at once.
However, I did not test aws s3 cp
part, because I use another tool to get S3 contents into STDIN.
Cheers!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With