I want to migrate a table from Amazon RedShift to MySQL, but using "unload" will generate multiple data files which are hard to imported into MySQL directly. Is there any approach to unload the table to a single CSV file so that I can import it to MySQL directly?

In order to send to a single file use parallel off <pre class="prettyprint"><code>unload ('select * from venue') to 's3://mybucket/tickit/unload/venue_' credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>' parallel off; </code></pre> Also I recommend using Gzip, to make that file even smaller for download. <pre class="prettyprint"><code>unload ('select * from venue') to 's3://mybucket/tickit/unload/venue_' credentials 'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>' parallel off gzip; </code></pre>

This is an old question at this point, but I feel like all the existing answers are slightly misleading. If your question is, "Can I absolutely 100% guarantee that Redshift will ALWAYS unload to a SINGLE file in S3?", the answer is simply NO. That being said, for most cases, you can generally limit your query in such a way that you'll end up with a single file. Per the documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html), the main factor in limiting the number of files you generate is the actual raw size in bytes of your export (NOT the number of rows). The limit on the size of an output file generated by the Redshift <code>UNLOAD</code> command is 6.2GB. So if you want to try to guarantee that you get a single output file from <code>UNLOAD</code>, here's what you should try: <ul> <li>Specify <code>PARALLEL OFF</code>. Parallel is "ON" by default and will generally write to multiple files unless you have a tiny cluster (The number of output files with "PARALLEL ON" set is proportional to the number of slices in your cluster). <code>PARALLEL OFF</code> will write files serially to S3 instead of in parallel and will only spill over to using multiple files if you exceed the size limit.</li> <li>Limit the size of your output. The raw size of the data must be less than 6.2GB if you want a single file. So you need to make your query have a more restrictive <code>WHERE</code> clause or use a <code>LIMIT</code> clause to keep the number of records down. Unfortunately neither of these techniques are perfect since rows can be of variable size. It's also not clear to me if the <code>GZIP</code> option affects the output file size spillover limit or not (it's unclear if 6.2GB is the pre-GZIP size limit or the post-GZIP size limit).</li> </ul> For me, the UNLOAD command that ending up generating a single CSV file in most cases was: <pre class="prettyprint"><code>UNLOAD ('SELECT <fields> FROM <table> WHERE <restrict_query>') TO 's3://<bucket_name>/<filename_prefix>' CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>' DELIMITER AS ',' ADDQUOTES NULL AS '' PARALLEL OFF; </code></pre> The other nice side effect of <code>PARALLEL OFF</code> is that it will respect your <code>ORDER BY</code> clause if you have one and generate the files in an order that keeps all the records ordered, even across multiple output files. Addendum: There seems to be some folkloric knowledge around using <code>LIMIT 2147483647</code> to force the leader node to do all the processing and generate a single output file, but this doesn't seem to be actually documented anywhere in the Redshift documentation and as such, relying on it seems like a bad idea since it could change at any time.

How to unload a table on RedShift to a single CSV file?

2 Answers

In order to send to a single file use parallel off

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off;

Also I recommend using Gzip, to make that file even smaller for download.

unload ('select * from venue')
to 's3://mybucket/tickit/unload/venue_' credentials 
'aws_access_key_id=<access-key-id>;aws_secret_access_key=<secret-access-key>'
parallel off
gzip;

answered Oct 24 '22 22:10

Dan Ciborowski - MSFT

This is an old question at this point, but I feel like all the existing answers are slightly misleading. If your question is, "Can I absolutely 100% guarantee that Redshift will ALWAYS unload to a SINGLE file in S3?", the answer is simply NO.

That being said, for most cases, you can generally limit your query in such a way that you'll end up with a single file. Per the documentation (https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html), the main factor in limiting the number of files you generate is the actual raw size in bytes of your export (NOT the number of rows). The limit on the size of an output file generated by the Redshift UNLOAD command is 6.2GB.

So if you want to try to guarantee that you get a single output file from UNLOAD, here's what you should try:

Specify PARALLEL OFF. Parallel is "ON" by default and will generally write to multiple files unless you have a tiny cluster (The number of output files with "PARALLEL ON" set is proportional to the number of slices in your cluster). PARALLEL OFF will write files serially to S3 instead of in parallel and will only spill over to using multiple files if you exceed the size limit.
Limit the size of your output. The raw size of the data must be less than 6.2GB if you want a single file. So you need to make your query have a more restrictive WHERE clause or use a LIMIT clause to keep the number of records down. Unfortunately neither of these techniques are perfect since rows can be of variable size. It's also not clear to me if the GZIP option affects the output file size spillover limit or not (it's unclear if 6.2GB is the pre-GZIP size limit or the post-GZIP size limit).

For me, the UNLOAD command that ending up generating a single CSV file in most cases was:

UNLOAD
('SELECT <fields> FROM <table> WHERE <restrict_query>')
TO 's3://<bucket_name>/<filename_prefix>'
CREDENTIALS 'aws_access_key_id=<access_key>;aws_secret_access_key=<secret_key>'
DELIMITER AS ','
ADDQUOTES
NULL AS ''
PARALLEL OFF;

The other nice side effect of PARALLEL OFF is that it will respect your ORDER BY clause if you have one and generate the files in an order that keeps all the records ordered, even across multiple output files.

Addendum: There seems to be some folkloric knowledge around using LIMIT 2147483647 to force the leader node to do all the processing and generate a single output file, but this doesn't seem to be actually documented anywhere in the Redshift documentation and as such, relying on it seems like a bad idea since it could change at any time.

answered Oct 24 '22 21:10

Brent Writes Code

Related questions
                            
                                mysql group by query with column filter
                            
                                How to get count 0 if data is not present in database using mySQL
                            
                                docker compose: spring boot connection to mysql database refused
                            
                                How to update null in mySQL using Laravel
                            
                                Is there a way to sync google sheets and MySQL so that an edit in the google sheets table will update the table in MySQL?
                            
                                How to store JSON file as a JSON datatype in mysql using Hibernate / Spring boot?
                            
                                Cannot open /proc/net/unix: No such file or directory // invoke-rc.d: could not determine current runlevel
                            
                                Docker: MySQL container bind host mysqld socket
                            
                                Should "duplicate" indices in MySQL be deleted?
                            
                                How do I integrate HubSpot data into a MySQL database automatically?
                            
                                storing binary data in mysql
                            
                                Plagiarism Analyzer (compared against Web Content)
                            
                                How to remove Â character
                            
                                What is the size limitation for IN and NOT IN in MySQL
                            
                                Error : "Fill: SelectCommand.Connection property has not been initialized."
                            
                                MySQL add 12 hours to a time field
                            
                                MySQL default datetime through phpmyadmin
                            
                                Do mysql update queries benefit from an index?
                            
                                How to renumber primary index
                            
                                Most efficient approach for multilingual PHP website

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to unload a table on RedShift to a single CSV file?

Tags:

mysql

amazon-web-services

amazon-redshift

ciphor

People also ask

2 Answers

Dan Ciborowski - MSFT

Brent Writes Code

Recent Activity

Donate For Us