I'm running a Redshift unload command, but am not getting the name I desire. The command is:
UNLOAD ('select * from foo')
TO 's3://mybucket/foo'
CREDENTIALS 'xxxxxx'
GZIP
NULL AS 'NULL'
DELIMITER as '\t'
allowoverwrite
parallel off
The result is mybucket/foo-000.gz. I don't want the slice number to be the end of the file name (it'd be great if it can be eliminated completely), I want to add a file extension at end of the file name. I'd like to see either of the following:
Is there any way to do this (without writing a lambda post process renamer script)?
TL;DR
No.
Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use PARALLEL FALSE
, but it is strongly recommended to leave it enabled. Even then, the file will always include the 000.[EXT]
suffix (when the [EXT]
exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:
By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. The default option is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used. The maximum size for a data file is 6.2 GB. So, for example, if you unload 13.4 GB of data, UNLOAD creates the following three files.
s3://mybucket/key000 6.2 GB s3://mybucket/key001 6.2 GB s3://mybucket/key002 1.0 GB
Therefore, it will alway add at least the prefix 000
, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.
If you ask why the use of PARALLEL FALSE
is not recommended, I'll try to explain it in several points:
PARALLEL
is TRUE
, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.COPY
and UNLOAD
work directly with the data nodes, therefore, they behave almost the same way as if you would use PARALLEL TRUE
. In the contrary, queries like SELECT
, UPDATE
, DELETE
and INSERT
, are processed by the leader node, that's why they suffer from the leader node loads.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With