Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Export big data from PostgreSQL to AWS s3

I have ~10TB of data in the PostgreSQL database. I need to export this data into AWS S3 bucket.

I know how to export into the local file, for example:

CONNECT DATABASE_NAME;
COPY (SELECT (ID, NAME, ADDRESS) FROM CUSTOMERS) TO ‘CUSTOMERS_DATA.CSV WITH DELIMITER '|' CSV;

but I don't have the local drive with 10TB size.

How to directly export to AWS S3 bucket?

like image 634
alexanoid Avatar asked Oct 28 '18 13:10

alexanoid


2 Answers

When exporting a large data dump your biggest concern should be mitigating failures. Even if you could saturate a GB network connection, moving 10 TB of data will take > 24 hours. You don't want to have to restart that due to a failure (such as a database connection timeout).

This implies that you should break the export into multiple pieces. You can do this by adding an ID range to the select statement inside the copy (I've just edited your example, so there may be errors):

COPY (SELECT (ID, NAME, ADDRESS) FROM CUSTOMERS WHERE ID BETWEEN 0 and 1000000) TO ‘CUSTOMERS_DATA_0.CSV WITH DELIMITER '|' CSV;

You would, of course, generate these statements with a short program; don't forget to change the name of the output file for each one. I recommend picking an ID range that gives you a gigabyte or so per output file, resulting in 10,000 intermediate files.

Where you write these files is up to you. If S3FS is sufficiently reliable, I think it's a good idea.

By breaking the unload into multiple smaller pieces, you can also divide it among multiple EC2 instances. You'll probably saturate the database machine's bandwidth with only a few readers. Also be aware that AWS charges $0.01 per GB for cross-AZ data transfer -- with 10TB that's $100 -- so make sure these EC2 machines are in the same AZ as the database machine.

It also means that you can perform the unload while the database is not otherwise busy (ie, outside of normal working hours).

Lastly, it means that you can test your process, and you can fix any data errors without having to run the entire export (or process 10TB of data for each fix).

On the import side, Redshift can load multiple files in parallel. This should improve your overall time, although I can't really say how much.

One caveat: use a manifest file rather than an object name prefix. I've run into cases where S3's eventual consistency caused files to be dropped during a load.

like image 118
kdgregory Avatar answered Oct 22 '22 14:10

kdgregory


You can pipe the output of a program to s3, as such:

cat "hello world" | aws s3 cp - s3://some-bucket/hello.txt

I'm not massively experienced with postgresql, but from what I understand the following should work:

psql -U user -d DATABASE_NAME -c "Copy (Select ID, NAME, ADDRESS From CUSTOMERS) To STDOUT With CSV HEADER DELIMITER ',';" | aws s3 cp - s3://some-bucket/CUSTOMERS_DATA.csv.gz
like image 21
thomasmichaelwallace Avatar answered Oct 22 '22 14:10

thomasmichaelwallace