Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?

like image 778
Gowtham Avatar asked Jan 07 '14 19:01

Gowtham


People also ask

Is it possible to load data from Amazon DynamoDB into Amazon Redshift?

Amazon Redshift complements Amazon DynamoDB with advanced business intelligence capabilities and a powerful SQL-based interface. When you copy data from a DynamoDB table into Amazon Redshift, you can perform complex data analysis queries on that data, including joins with other tables in your Amazon Redshift cluster.

How do I export data from AWS DynamoDB?

To export a DynamoDB table, you use the AWS Data Pipeline console to create a new pipeline. The pipeline launches an Amazon EMR cluster to perform the actual export. Amazon EMR reads the data from DynamoDB, and writes the data to an export file in an Amazon S3 bucket.

What is the most efficient and fastest way to load data into redshift?

A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.


2 Answers

Dynamo DB has a feature (currently in preview) called Streams:

Amazon DynamoDB Streams maintains a time ordered sequence of item level changes in any DynamoDB table in a log for a duration of 24 hours. Using the Streams APIs, developers can query the updates, receive the item level data before and after the changes, and use it to build creative extensions to their applications built on top of DynamoDB.

This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.

You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.

like image 110
mkobit Avatar answered Sep 28 '22 06:09

mkobit


The copy from redshift can only copy the entire table. There are several ways to achieve this

  1. Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.

  2. You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift

like image 31
Gowtham Avatar answered Sep 28 '22 06:09

Gowtham