What's the best way to create RFC-4180-friendly CSV files from Amazon Redshift UNLOAD?

Tags:

amazon-redshift

Amazon Redshift doesn't really have a CSV option when using UNLOAD (e.g., similar to the WITH CSV available in PostgreSQL's COPY command).

If you use the ESCAPE option, Redshift escapes double quotes with a \. Excel, for example, doesn't know what to do with that slash. It should really be using a double quote, per RFC-4180

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote. For example:

"aaa","b""bb","ccc"

Sooo... quotes are going to break exports unless we can figure out a way to make Redshift escape them properly.

Given the following data:

# select * from unload_test;
                 test_field                 | test_field_number
 --------------------------------------------+-------------------
 "No," she said, "that's not how we do it." |                 5
 Hi! This is a test, yo.                    |                 5

If we use ESCAPE but no ADDQUOTES, Excel et al continues to break columns on all commas including the "escaped" ones (\,). Example
If we use both ESCAPE and ADDQUOTES you get the result you see in the initial issue example.
If we use ADDQUOTES but we don't use ESCAPE then it will break on double-quotes in the data. Example
Of course, if we don't use either then we'll also break on commas and possible quotes.

Is there an available work-around for this that will accommodate both quotes and commas in the data?

840

asked Jan 28 '16 18:01

People also ask

How do you unload data from Redshift to S3 in CSV?

Unload VENUE to a CSV file The following example unloads the VENUE table and writes the data in CSV format to s3://mybucket/unload/ . unload ('select * from venue') to 's3://mybucket/unload/' iam_role 'arn:awsawsAmazon Web Services, Inc. (AWS) is a subsidiary of Amazon that provides on-demand cloud computing platforms and APIs to individuals, companies, and governments, on a metered pay-as-you-go basis. These cloud computing web services provide distributed computing processing capacity and software tools via AWS server farms.https://en.wikipedia.org › wiki › Amazon_Web_ServicesAmazon Web Services - Wikipedia:iam::0123456789012:role/MyRedshiftRole' CSV; Suppose that the VENUE table contains the following rows.

Can you remove 000 file extension after a CSV unload from Amazon Redshift to S3?

Yes, because thats the naming convention unload uses in order to avoid duplicate names which I believe cannot be avoided from our end.

How do you unload data in Redshift?

To unload data from database tables to a set of files in an Amazon S3 bucket, you can use the UNLOAD command with a SELECT statement. You can unload text data in either delimited format or fixed-width format, regardless of the data format that was used to load it.

1 Answers

Scaling solution was to process contents using perl / sed and convert it to proper CSV.

These flags are required: DELIMITER ',' ADDQUOTES ESCAPE.

aws s3 cp s3://mybucket/non_csv/part_0000 - | \
# Unescapes \" into "" but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | \
# Removes \ but only when odd number of backslashes are found backwards
perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | \
# Unescapes \\ into \
sed -e 's/\\\\/\\/g' | \
aws s3 cp - s3://mybucket/csv/part_0000

I tested this using various edge cases on a real noisy data and it produces correct results. All backslashes are removed because nothing must be escaped except " -> "" and added quoting.

Example test:

$ echo '"","\"\"Hi\\\\\"","\\\\","\<tab>"' | perl -pe 's/((?<![\\])(\\\\)*)\\"/\1""/g' | perl -pe 's/((?<![\\])(\\\\)*)\\/\1/g' | sed -e 's/\\\\/\\/g'

Output:

"","""""Hi\\""","\\","<tab>"

PERL or other advanced regexp engine is required to do a correct look-behind.

Performance was not an issue, because I hit network bandwidth instead of CPU. However, one can use GNU parallel to process all output files at once.

However, I did not test aws s3 cp part, because I use another tool to get S3 contents into STDIN.

Cheers!

173

answered Oct 10 '22 05:10

TautrimasPajarskas

Related questions
                            
                                How to get Beyond Compare to compare data files matching columns by name, not by order
                            
                                Importing CSV file into multiple models at one time
                            
                                Fastest way to extract only certain fields from comma separated string in Python
                            
                                Rigorous definition for CSV file reading/writing
                            
                                How to write a list with a nested dictionary to a csv file?
                            
                                CSV dialect in pandas DataFrame to_csv (python)
                            
                                How to import data via alembic (from a CSV file)?
                            
                                Download file with R given a JavaScript Statement
                            
                                Upload File parameter not coming through to controller
                            
                                Python adding a blank/empty column. csv
                            
                                Python CSV read file and select columns and write to new CSV file
                            
                                Convert (xls, xlsx) to CSV before Upload using PHP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With