Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do Real-time loading into Amazon Redshift?

We are evaluating Amazon Redshift for real time data warehousing.

Data will be streamed and processed through a Java service and it should be stored in the database. We process row by row (real time) and we will only insert one row per transaction.

What is best practice for real time data loading to Amazon Redshift?

Shall we use JDBC and perform INSERT INTO statements, or try to use Kinesis Firehose, or perhaps AWS Lambda?

I'm concerned about using one of these services because both will use Amazon S3 as a middle layer and perform the COPY command which is suitable for bigger data sets, not for "one-row" inserts.

like image 201
fenix Avatar asked Jan 14 '17 19:01

fenix


2 Answers

It is not efficient to use individual INSERT statements with Amazon Redshift. It is designed as a Data Warehouse, providing very fast SQL queries. It is not a transaction-processing database where data is frequently updated and inserted.

The best practice is to load batches (or micro-batches) via the COPY command. Kinesis Firehose uses this method. This is much more efficient, because multiple nodes are used to load the data in parallel.

If you are seriously looking at processing data in real-time, then Amazon Redshift might not be the best database to use. Consider using a traditional SQL database (eg those provided by Amazon RDS), a NoSQL database (such as Amazon DynamoDB) or even Elasticsearch. You should only choose to use Redshift if your focus is on reporting across large volumes of data, typically involving many table joins.

As mentioned in the Amazon Redshift Best Practices for Loading Data:

If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.

like image 93
John Rotenstein Avatar answered Sep 18 '22 16:09

John Rotenstein


The best option is Kinesis Firehose, which is working on batches of events. You are writing the events into Firehose, one by one, and it is batching it in an optimal way, based on your definition. You can define how many minutes to batch the events, or the size of the batch in MB. You might be able to insert the event faster into Redshift with INSERT, but this method is not scalable. COPY designed to work in almost every scale.

like image 38
Guy Avatar answered Sep 21 '22 16:09

Guy