Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Redshift JDBC insert performance

Tags:

I am writing a proof-of-concept app which is intended to take live clickstream data at the rate of around 1000 messages per second and write it to Amazon Redshift.

I am struggling to get anything like the performance some others claim (for example, here).

I am running a cluster with 2 x dw.hs1.xlarge nodes (+ leader), and the machine that is doing the load is an EC2 m1.xlarge instance on the same VPC as the Redshift cluster running 64 bit Ubuntu 12.04.1.

I am using Java 1.7 (openjdk-7-jdk from the Ubuntu repos) and the Postgresql 9.2-1002 driver (principally because it's the only one in Maven Central which makes my build easier!).

I've tried all the techniques shown here, except the last one.

I cannot use COPY FROM because we want to load data in "real time", so staging it via S3 or DynamoDB isn't really an option, and Redshift doesn't support COPY FROM stdin for some reason.

Here is an excerpt from my logs showing that individual rows are being inserted at the rate of around 15/second:

2013-05-10 15:05:06,937 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Beginning batch of 170
2013-05-10 15:05:18,707 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:05:18,708 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Beginning batch of 712
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Beginning batch of 167
2013-05-10 15:06:14,381 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Done

What am I doing wrong? What other approaches could I take?

like image 954
dty Avatar asked May 10 '13 15:05

dty


People also ask

Why is Redshift insert slow?

The reason single inserts are slow is the way Redshift handles commits. Redshift has a single queue for commit. Say you insert row 1, then commit - it goes to the redshift commit queue to finish commit. Next row , row 2, then commit - again goes to the commit queue.

What is the quickest and most efficient way to load a large amount of on premises data to AWS Redshift cluster?

A COPY command is the most efficient way to load a table. You can also add data to your tables using INSERT commands, though it is much less efficient than using COPY. The COPY command is able to read from multiple data files or multiple data streams simultaneously.

What is the most effective way to merge data into an existing table in Redshift?

You can efficiently update and insert new data by loading your data into a staging table first. Amazon Redshift doesn't support a single merge statement (update or insert, also known as an upsert) to insert and update data from a single data source.

How can I improve the performance of my Amazon Redshift database?

Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT INTO SELECT or CREATE TABLE AS to improve performance. For more information about using the COPY command to load tables, see Loading data.

What is the best JDBC driver for Amazon Redshift?

Because Amazon Redshift is based on PostgreSQL, we previously recommended using JDBC4 PostgreSQL driver version 8.4.703 and psql ODBC version 9.x drivers. If you’re currently using those drivers, we recommend moving to the new Amazon Redshift–specific drivers.

How do I write effective data retrieval queries in Amazon Redshift?

Once your system is set up, you typically work with DML the most, especially the SELECT command for retrieving and viewing data. To write effective data retrieval queries in Amazon Redshift, become familiar with SELECT and apply the tips outlined in Amazon Redshift best practices for designing tables to maximize query efficiency.

What is federated query in Amazon Redshift?

The new Federated Query feature in Amazon Redshift allows you to run analytics directly against live data residing on your OLTP source system databases and Amazon S3 data lake, without the overhead of performing ETL and ingesting source data into Amazon Redshift tables.


2 Answers

Redshift (aka ParAccel) is an analytic database. The goal is enable analytic queries to be answered quickly over very large volumes of data. To that end Redshift stores data in a columnar format. Each column is held separately and compressed against the previous values in the column. This compression tends to be very effective because a given column usually holds many repetitive and similar data.

This storage approach provides many benefits at query time because only the requested columns need to be read and the data to be read is very compressed. However, the cost of this is that inserts tend to be slower and require much more effort. Also inserts that are not perfectly ordered may result in poor query performance until the tables are VACUUM'ed.

So, by inserting a single row at a time you are completely working against the the way that Redshift works. The database is has to append your data to each column in succession and calculate the compression. It's a little bit (but not exactly) like adding a single value to large number of zip archives. Additionally, even after your data is inserted you still won't get optimal performance until you run VACUUM to reorganise the tables.

If you want to analyse your data in "real time" then, for all practical purposes, you should probably choose another database and/or approach. Off the top of my head here are 3:

  1. Accept a "small" batching window (5-15 minutes) and plan to run VACUUM at least daily.
  2. Choose an analytic database (more $) which copes with small inserts, e.g., Vertica.
  3. Experiment with "NoSQL" DBs that allow single path analysis, e.g., Acunu Cassandra.
like image 189
Joe Harris Avatar answered Oct 11 '22 13:10

Joe Harris


The reason single inserts are slow is the way Redshift handles commits. Redshift has a single queue for commit.

Say you insert row 1, then commit - it goes to the redshift commit queue to finish commit.

Next row , row 2, then commit - again goes to the commit queue. Say during this time if the commit of row 1 is not complete, row 2 waits for the commit of 1 to complete and then gets started to work on row 2 commit.

So if you batch your inserts, it does a single commit and is faster than single commits to the Redshift system.

You can get commit queue information via the issue Tip #9: Maintaining efficient data loads in the link below. https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/

like image 34
scorpio155 Avatar answered Oct 11 '22 13:10

scorpio155