Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Near real-time ETL from MySQL to Redshift

I'm trying to figure out most efficient way to extract data from MySQL, transform it and load to Redshift in near real-time. Currently we have overnight ETL process (using Pentaho) which last ~ 40min and we want to replace it with near real-time (mini batch 1-5 minutes).

I found couple of tools for data extraction from MySQL binlog (i.e.http://maxwells-daemon.io) with connector to AWS Kineses and I plan to transform and join data with Apache Spark or AWS Lambda and write it to S3 and from there COPY command writing it to Redshift.

Does anybody have any suggestion or recommendation regarding this or similar solution? Thanks in advance!

like image 659
Luka Krstev Avatar asked Oct 29 '22 22:10

Luka Krstev


1 Answers

It really depends on what transformations you want to do on data coming from MySQL, and what language and framework is most suitable in your environment.

Below are my suggestions

  1. Add intermediary storage between MySQL and Redshift, for example S3
  2. Consider data re-loading mechanism in case data loads to Redshift fails

I would consider AWS Data Pipeline for this because it has ready-to-use templates, retry mechanisms and built-in logging and monitoring.

It could look like this:

MySQL -> S3 -> Redshift (staging) -> SQL -> Redshift (production)

MySQL -> S3

This data pipeline would start with "Incremental copy of RDS MySQL to S3" template.

You can paramaterize S3 path in which to store incremental data from MySQL so it is easier to manage those increments over time using S3 prefixes. If data on S3 is not to be used ever again, you can delete those objects with S3 Object Lifecycle Management periodically.

But, having data on S3 provides several other benefits, you can query it with AWS Athena, visualize using QuickSight, or just archive using Infrequent Access or Glacier storage class to reduce storage costs but keep it for future use.

S3 -> Redshift (staging)

Of course, for your use case, it must go to Redshift, so I recommend AWS Redshift Loader Lambda. Its setup is a little bit complex, but once completed successfully, it is very close to zero administration, as they claim. See this

With AWS Redshift Loader, every time data arrives on a defined S3 prefix, it is loaded to Redshift cluster (one more more) and you can have SNS configured to notify you or some logging system about it. There are other options, such as, waiting for x files before loading, or loading every y minutes.

Also, you might have a requirement to load only some of the data from S3 to Redshift into testing or development environment. With AWS Redshift Loader you could define only a specific S3 prefix e.g. /data/2017/11 to load to Redshift.

By using Redshift Loader, you are making data load asynchronous, as such, it is a bit harder to minor and control that process. This might be an issue for your case or not.

If data load to Redshift fails, you use Redshfit Loader command line tooling to reloading specific objects to Redshfit.

Redshift(staging) -> Redshift(production)

Note that Redshift does not enforce referential integrity, for example unique keys, which means you would have to have mechanisms to prevent inserting duplicate rows into Redshift tables. If you don't worry about duplicates, this is irrelevant, your data is already in Redshift.

Why does it matter? Because, if data retrieved from source, that is, MySQL, is already in Redshift, you have to know what actions you want to take. Do you overwrite it, or just ignore it.

By doing it in Redshift itself it would be easy to compare new and existing rows using a unique column and either delete and insert or just update. Doing it outside Redshfit, would probably mean keeping track of unique keys already available in Redshift outside of it and doing comparison there. Where? When to update them? How? Maybe you have a solution for it already.

So, in a previous step you insert data to Redshift (staging). In order to ensure data integrity when moving it to production tables, we have to do merging. AWS recommends the following techniques to merge data in Redsift.

If that's the case, why not doing transformations using SQL as well so there is less components to manage?

You can define your transformation jobs using SQL, store those SQL scripts on S3 and reference them in SQLActivity for execution on your Redsshift cluster alongside data merging scripts.

AWS Kinesis might also be suitable given its transformation and storage to S3 capabilities. Some of the points I have raised above will be applicable to using Kinesis too.

like image 195
Pawel Avatar answered Nov 15 '22 07:11

Pawel