Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading data (incrementally) into Amazon Redshift, S3 vs DynamoDB vs Insert

Tags:

I have a web app that needs to send reports on its usage, I want to use Amazon RedShift as a data warehouse for that purpose, How should i collect the data ?

Every time, the user interact with my app, i want to report that.. so when should i write the files to S3 ? and how many ? What i mean is: - If do not send the info immediately, then I might lose it as a result of a connection lost, or from some bug in my system while its been collected and get ready to be sent to S3... - If i do write files to S3 on each user interaction, i will end up with hundreds of files (on each file has minimal data), that need to be managed, sorted, deleted after been copied to RedShift.. that dose not seems like a good solution .

What am i missing? Should i use DynamoDB instead, Should i use simple insert into Redshift instead !?
If i do need to write the data to DynamoDB, should i delete the hold table after been copied .. what are the best practices ?

On any case what are the best practices to avoid data duplication in RedShift ?

Appreciate the help!

like image 942
Ofer Velich Avatar asked Jan 11 '14 23:01

Ofer Velich


People also ask

What is the most efficient and fastest way to load data into Redshift?

A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.

Is S3 faster than DynamoDB?

For relatively small items, especially those with a size of less than 4 KB, DynamoDB runs individual operations faster than Amazon S3. DynamoDB can scale on-demand, but S3 offers better scalability.

Which of the following is the recommended method for loading large volumes of data into Redshift?

Method 1: Loading Data to Redshift using the Copy Command The Redshift COPY command is the standard way of loading bulk data TO Redshift.

Is it possible to load data from Amazon DynamoDB into Amazon Redshift?

Amazon Redshift complements Amazon DynamoDB with advanced business intelligence capabilities and a powerful SQL-based interface. When you copy data from a DynamoDB table into Amazon Redshift, you can perform complex data analysis queries on that data, including joins with other tables in your Amazon Redshift cluster.


1 Answers

It is preferred to aggregate event logs before ingesting them into Amazon Redshift.

The benefits are:

  • You will use the parallel nature of Redshift better; COPY on a set of larger files in S3 (or from a large DynamoDB table) will be much faster than individual INSERT or COPY of a small file.

  • You can pre-sort your data (especially if the sorting is based on event time) before loading it into Redshift. This is also improve your load performance and reduce the need for VACUUM of your tables.

You can accumulate your events in several places before aggregating and loading them into Redshift:

  • Local file to S3 - the most common way is to aggregate your logs on the client/server and every x MB or y minutes upload them to S3. There are many log appenders that are supporting this functionality, and you don't need to make any modifications in the code (for example, FluentD or Log4J). This can be done with container configuration only. The down side is that you risk losing some logs and these local log files can be deleted before the upload.

  • DynamoDB - as @Swami described, DynamoDB is a very good way to accumulate the events.

  • Amazon Kinesis - the recently released service is also a good way to stream your events from the various clients and servers to a central location in a fast and reliable way. The events are in order of insertion, which makes it easy to load it later pre-sorted to Redshift. The events are stored in Kinesis for 24 hours, and you can schedule the reading from kinesis and loading to Redshift every hour, for example, for better performance.

Please note that all these services (S3, SQS, DynamoDB and Kinesis) allow you to push the events directly from the end users/devices, without the need to go through a middle web server. This can significantly improve the high availability of your service (how to handle increased load or server failure) and the cost of the system (you only pay for what you use and you don't need to have underutilized servers just for logs).

See for example how you can get temporary security tokens for mobile devices here: http://aws.amazon.com/articles/4611615499399490

Another important set of tools to allow direct interaction with these services are the various SDKs. For example for Java, .NET, JavaScript, iOS and Android.

Regarding the de-duplication requirement; in most of the options above you can do that in the aggregation phase, for example, when you are reading from a Kinesis stream, you can check that you don't have duplications in your events, but analysing a large buffer of events before putting into the data store.

However, you can do this check in Redshift as well. A good practice is to COPY the data into a staging tables and then SELECT INTO a well organized and sorted table.

Another best practice you can implement is to have a daily (or weekly) table partition. Even if you would like to have one big long events table, but the majority of your queries are running on a single day (the last day, for example), you can create a set of tables with similar structure (events_01012014, events_01022014, events_01032014...). Then you can SELECT INTO ... WHERE date = ... to each of this tables. When you want to query the data from multiple days, you can use UNION_ALL.

like image 161
Guy Avatar answered Oct 22 '22 19:10

Guy