Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the best way to create RFC-4180-friendly CSV files from Amazon Redshift UNLOAD?

Amazon Redshift doesn't really have a CSV option when using UNLOAD (e.g., similar to the WITH CSV available in PostgreSQL's COPY command).

If you use the ESCAPE option, Redshift escapes double quotes with a \. Excel, for example, doesn't know what to do with that slash. It should really be using a double quote, per RFC-4180

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

"aaa","b""bb","ccc"

Sooo... quotes are going to break exports unless we can figure out a way to make Redshift escape them properly.

Given the following data:

# select * from unload_test;
                 test_field                 | test_field_number
 --------------------------------------------+-------------------
 "No," she said, "that's not how we do it." |                 5
 Hi! This is a test, yo.                    |                 5
  • If we use ESCAPE but no ADDQUOTES, Excel et al continues to break columns on all commas including the "escaped" ones (\,). Example
  • If we use both ESCAPE and ADDQUOTES you get the result you see in the initial issue example.
  • If we use ADDQUOTES but we don't use ESCAPE then it will break on double-quotes in the data. Example
  • Of course, if we don't use either then we'll also break on commas and possible quotes.

Is there an available work-around for this that will accommodate both quotes and commas in the data?

like image 840
Brian D. Avatar asked Jan 28 '16 18:01

Brian D.


People also ask

How do you unload data from Redshift to S3 in CSV?

Unload VENUE to a CSV file The following example unloads the VENUE table and writes the data in CSV format to s3://mybucket/unload/ . unload ('select * from venue') to 's3://mybucket/unload/' iam_role 'arn:awsawsAmazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide distributed computing processing capacity and software tools via AWS server farms.https://en.wikipedia.org › wiki › Amazon_Web_ServicesAmazon Web Services - Wikipedia:iam::0123456789012:role/MyRedshiftRole' CSV; Suppose that the VENUE table contains the following rows.

Can you remove 000 file extension after a CSV unload from Amazon Redshift to S3?

Yes, because thats the naming convention unload uses in order to avoid duplicate names which I believe cannot be avoided from our end.

How do you unload data in Redshift?

To unload data from database tables to a set of files in an Amazon S3 bucket, you can use the UNLOAD command with a SELECT statement. You can unload text data in either delimited format or fixed-width format, regardless of the data format that was used to load it.


1 Answers

Scaling solution was to process contents using perl / sed and convert it to proper CSV.

These flags are required: DELIMITER ',' ADDQUOTES ESCAPE.

aws s3 cp s3://mybucket/non_csv/part_0000 - | \
# Unescapes \" into "" but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | \
# Removes \ but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | \
# Unescapes \\ into \
sed -e 's/\\\\/\\/g' | \
aws s3 cp - s3://mybucket/csv/part_0000

I tested this using various edge cases on a real noisy data and it produces correct results. All backslashes are removed because nothing must be escaped except " -> "" and added quoting.

Example test:

$ echo '"","\"\"Hi\\\\\"","\\\\","\<tab>"' | perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | sed -e 's/\\\\/\\/g'

Output:

"","""""Hi\\""","\\","<tab>"

PERL or other advanced regexp engine is required to do a correct look-behind.

Performance was not an issue, because I hit network bandwidth instead of CPU. However, one can use GNU parallel to process all output files at once.

However, I did not test aws s3 cp part, because I use another tool to get S3 contents into STDIN.

Cheers!

like image 173
TautrimasPajarskas Avatar answered Oct 10 '22 05:10

TautrimasPajarskas