Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon Redshift: Copying Data Between Databases

I am looking to Copy data within databases on Amazon Redshift. Before this, I was copying data from a Redshift database to a PostgreSQL hosted on an EC2 instance for analytical purpose. I had ruby script that would do it using dblink EXTENSION.

But now since the data is un-managable on PostgreSQL instance, we have decided to copy the data to a separate database, on the same Redshift cluster.

Bump! What I found was that dblink is not available for Postgre 8.0 (Redshift Version). Is there a way I can get this working on Redshift.

I don't want to use the Copy command and add a layer of S3 in between or even data-pipeline. You see, it's the same cluster, just different databases.

In case, there is no other way, shall I use data-pipeline or Copy to S3 and then back to Redshift. Or is it essentially the same in the end?

P.S. Any help/leads would be appreciated. I would do my hard work. Just need a heads up.

like image 447
Sambhav Sharma Avatar asked Jun 01 '15 12:06

Sambhav Sharma


1 Answers

This seems now possible (since June 3, 2016)

See: https://blogs.aws.amazon.com/bigdata/post/Tx1GQ6WLEWVJ1OX/JOIN-Amazon-Redshift-AND-Amazon-RDS-PostgreSQL-WITH-dblink

In the above article, they give an example to copy from redshift to an other database:

I'll annotate with (postgres cluster) and (redshift cluster) for clarity

Copying data from Amazon Redshift to RDS PostgreSQL

The SQL to create the table: (postgres cluster)

CREATE TABLE sales_summary (
   saletime timestamp,
   sellerid int,
   sales decimal
);

[...]

The following query shows how to insert data into the tables using the dblink function:

copy is from (redshift cluster) to (postgres cluster)

INSERT INTO sales_summary
SELECT *
FROM dblink('foreign_server',$REDSHIFT$
    SELECT date_trunc('hours', saletime) AS ts, sellerid, sum(pricepaid) sales
    FROM sales 
    WHERE saletime >= '2008-01-01'
    AND saletime < '2008-02-01'
    GROUP BY ts, sellerid 
    ORDER BY sales 
$REDSHIFT$) AS t1 (saletime timestamp, sellerid int, sales decimal);

Edit: I've used this in production to load tables with many millions of rows daily, and I haven't had a single issue related to it. So I would recommend it as the preferred method for loading from Redshift to RDS-postgres (As opposed to COPYing, or worse, loading the data in memory with an ORM)

like image 78
Benjamin Crouzier Avatar answered Sep 17 '22 12:09

Benjamin Crouzier