Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tool to "Data Load" or "ETL" -- from SQL Server into Amazon Redshift

I am trying to figure out decent but simple tool which I can host myself in AWS EC2, which will allow me to pull data out of SQL Server 2005 and push to Amazon Redshift. I basically have a view in SQL Server on which I am doing SELECT * and I need just put all this data into Redshift. The biggest concern is that there is a lot of data, and this will need to be configurable so I can queue it, run as a nighly/continuous job, etc.

Any suggestions?

like image 651
alexeypro Avatar asked Dec 13 '25 17:12

alexeypro


2 Answers

alexeypro,

dump tables to files, then you have two fundamental challenges to solve:

  1. Transporting data to Amazon
  2. Loading data to Redshift tables.

Amazon S3 will help you with both:

  1. S3 supports fast upload of files to Amazon from your SQL server location. See this great article. It is from 2011 but I did some testing a few months back and saw very similar results. I was testing with gigabytes of data and 16 uploader threads were ok, as I'm not on backbone. Key thing to remember is that compression and parallel upload are your friends to cut down the time for upload.

  2. Once data are on S3, Redshift supports high-performance parallel load from files on S3 to table(s) via COPY SQL command. To get fastest load performance pre-partition your data based on table distribution key and and pre-sort it to avoid expensive vacuums. All is well documented in Amazon's best practices. I have to say these guys know how to make things neat & simple, so just follow the steps.

If you are coder you can orchestrate the whole process remotely using scripts in whatever shell/language you want. You'll need tools/libraries for parallel HTTP upload to S3 and command line access to Redshift (psql) to launch the COPY command.

Another options is Java, there are libraries for S3 upload and JDBC access to Redshift.

As other posters suggest, you could probably use SSIS (or essentially any other ETL tool) as well. I was testing with CloverETL. Took care of automating the process as well as partitioning/presorting the files for load.

like image 104
Michal Avatar answered Dec 15 '25 10:12

Michal


Now Microsoft released SSIS Powerpack, so you can do it natively.

SSIS Amazon Redshift Data Transfer Task

Very fast bulk copy from on-premises data to Amazon Redshift in few clicks
Load data to Amazon Redshift from traditional DB engines like SQL Server, Oracle, MySQL, DB2
Load data to Amazon Redshift from Flat Files
Automatic file archiving support
Automatic file compression support to reduce bandwidth and cost
Rich error handling and logging support to troubleshoot Redshift Datawarehouse loading issues
Support for SQL Server 2005, 2008, 2012, 2014 (32 bit and 64 bit) 

Why SSIS PowerPack?

High performance suite of Custom SSIS tasks, transforms and adapters 
like image 24
testmeok Avatar answered Dec 15 '25 09:12

testmeok



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!