This answer suggests using AWS Data Pipeline but I'm wondering if there's a clever way to do it with my own machine and Python.
I've been using psycopg2, boto3 and pandas libraries. Tables have 5 to 50 columns and few million rows. My current method doesn't work that well with large data.
Guess I can show one of my own versions here aswell which is based on copy_expert
in psycopg2
import io
import psycopg2
import boto3
resource = boto3.resource('s3')
conn = psycopg2.connect(dbname=db, user=user, password=pw, host=host)
cur = conn.cursor()
def copyFun(bucket, select_query, filename):
query = f"""COPY {select_query} TO STDIN \
WITH (FORMAT csv, DELIMITER ',', QUOTE '"', HEADER TRUE)"""
file = io.StringIO()
cur.copy_expert(query, file)
resource.Object(bucket, f'{filename}.csv').put(Body=file.getvalue())
We do following in our case, performance wise, its pretty fast, and scheduled method rather then continuous streaming. I'm not 100% sure if its wise method, but definitely good from speed prospective in case of scheduled data exports in CSV format that we eventually use for loading to d/w
.
Using shell script, we fire psql
command to copy
data to local file in EC2 App intance.
psql [your connection options go here] -F, -A -c 'select * from my_schema.example' >example.csv
Then, using shell script, we fire s3cmd
command to Put
example.csv
to designated S3:bucket Location
.
s3cmd put example.csv s3://your-bucket/path/to/file/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With