Amazon redshift: bulk insert vs COPYing from s3

Tags:

I have a redshift cluster that I use for some analytics application. I have incoming data that I would like to add to a clicks table. Let's say I have ~10 new 'clicks' that I want to store each second. If possible, I would like my data to be available as soon as possible in redshift.

From what I understand, because of the columnar storage, insert performance is bad, so you have to insert by batches. My workflow is to store the clicks in redis, and every minute, I insert the ~600 clicks from redis to redshift as a batch.

I have two ways of inserting a batch of clicks into redshift:

Multi-row insert strategy: I use a regular insert query for inserting multiple rows. Multi-row insert documentation here
S3 Copy strategy: I copy the rows in s3 as clicks_1408736038.csv. Then I run a COPY to load this into the clicks table. COPY documentation here

I've done some tests (this was done on a clicks table with already 2 million rows):

             | multi-row insert stragegy |       S3 Copy strategy    |              |---------------------------+---------------------------+              |       insert query        | upload to s3 | COPY query | -------------+---------------------------+--------------+------------+ 1 record     |           0.25s           |     0.20s    |   0.50s    | 1k records   |           0.30s           |     0.20s    |   0.50s    | 10k records  |           1.90s           |     1.29s    |   0.70s    | 100k records |           9.10s           |     7.70s    |   1.50s    |

As you can see, in terms of performance, it looks like I gain nothing by first copying the data in s3. The upload + copy time is equal to the insert time.

Questions:

What are the advantages and drawbacks of each approach ? What is the best practise ? Did I miss anything ?

And side question: is it possible for redshift to COPY the data automatically from s3 via a manifest ? I mean COPYing the data as soon as new .csv files are added into s3 ? Doc here and here. Or do I have to create a background worker myself to trigger the COPY commands ?

My quick analysis:

In the documentation about consistency, there is no mention about loading the data via multi-row inserts. It looks like the preferred way is COPYing from s3 with unique object keys (each .csv on s3 has its own unique name)...

S3 Copy strategy:
- PROS: looks like the good practice from the docs.
- CONS: More work (I have to manage buckets and manifests and a cron that triggers the COPY commands...)
Multi-row insert strategy
- PROS: Less work. I can call an insert query from my application code
- CONS: doesn't look like a standard way of importing data. Am I missing something?

277

asked Aug 22 '14 19:08

Benjamin Crouzier

1 Answers

Redshift is an Analytical DB, and it is optimized to allow you to query millions and billions of records. It is also optimized to allow you to ingest these records very quickly into Redshift using the COPY command.

The design of the COPY command is to work with parallel loading of multiple files into the multiple nodes of the cluster. For example, if you have a 5 small node (dw2.xl) cluster, you can copy data 10 times faster if you have your data is multiple number of files (20, for example). There is a balance between the number of files and the number of records in each file, as each file has some small overhead.

This should lead you to the balance between the frequency of the COPY, for example every 5 or 15 minutes and not every 30 seconds, and the size and number of the events files.

Another point to consider is the 2 types of Redshift nodes you have, the SSD ones (dw2.xl and dw2.8xl) and the magnetic ones (dx1.xl and dw1.8xl). The SSD ones are faster in terms of ingestion as well. Since you are looking for very fresh data, you probably prefer to run with the SSD ones, which are usually lower cost for less than 500GB of compressed data. If over time you have more than 500GB of compressed data, you can consider running 2 different clusters, one for "hot" data on SSD with the data of the last week or month, and one for "cold" data on magnetic disks with all your historical data.

Lastly, you don't really need to upload the data into S3, which is the major part of your ingestion timing. You can copy the data directly from your servers using the SSH COPY option. See more information about it here: http://docs.aws.amazon.com/redshift/latest/dg/loading-data-from-remote-hosts.html

If you are able to split your Redis queues to multiple servers or at least multiple queues with different log files, you can probably get very good records per second ingestion speed.

Another pattern that you may want to consider to allow near real time analytics is the usage of Amazon Kinesis, the streaming service. It allows to run analytics on data in delay of seconds, and in the same time prepare the data to copy into Redshift in a more optimized way.

149

answered Oct 10 '22 22:10

Guy

Related questions
                            
                                Upload to Amazon S3 using Boto3 and return public url
                            
                                Using GoDaddy Domain Hosting to link to Amazon S3 Website [closed]
                            
                                Download a folder from S3 using Boto3
                            
                                Allow AJAX GETs from Amazon S3? (Access-Control-Allow-Origin)
                            
                                S3 not returning Access-Control-Allow-Origin headers?
                            
                                Routing to Angular components when app is static content hosted in S3
                            
                                How to make MSCK REPAIR TABLE execute automatically in AWS Athena
                            
                                SSL CERTIFICATE_VERIFY_FAILED in aws cli
                            
                                how to upload multiple images to a blog post in django
                            
                                Receive AccessDenied when trying to access a page via the full url on my website
                            
                                Reading data from S3 using Lambda
                            
                                aws lambda function triggering multiple times for a single event
                            
                                Amazon s3 static web hosting caching
                            
                                Adding AWS Lambda with VPC configuration causes timeout when accessing S3
                            
                                AWS CLI S3: copying file locally using the terminal : fatal error: An error occurred (404) when calling the HeadObject operation
                            
                                Why does Browser still sends request for cache-control public with max-age?
                            
                                Getting S3 objects' last modified datetimes with boto
                            
                                how to copy s3 object from one bucket to another using python boto3
                            
                                AWS S3 Java SDK - Download file help
                            
                                Limit Size Of Objects While Uploading To Amazon S3 Using Pre-Signed URL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Amazon redshift: bulk insert vs COPYing from s3

Tags:

amazon-s3

amazon-redshift

Benjamin Crouzier

People also ask

1 Answers

Guy

Recent Activity

Donate For Us