Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Poor performance on Amazon Redshift queries based on VARCHAR size

I'm building an Amazon Redshift data warehouse, and experiencing unexpected performance impacts based on the defined size of the VARCHAR column. Details are as follows. Three of my columns are shown from pg_table_def:

 schemaname | tablename |     column      |            type             | encoding  | distkey | sortkey | notnull 
------------+-----------+-----------------+-----------------------------+-----------+---------+---------+---------
 public     | logs      | log_timestamp   | timestamp without time zone | delta32k  | f       |       1 | t
 public     | logs      | event           | character varying(256)      | lzo       | f       |       0 | f
 public     | logs      | message         | character varying(65535)    | lzo       | f       |       0 | f

I've recently run Vacuum and Analyze, I have about 100 million rows in the database, and I'm seeing very different performance depending on which columns I include.

Query 1: For instance, the following query takes about 3 seconds:

select log_timestamp from logs order by log_timestamp desc limit 5;

Query 2: A similar query asking for more data runs in 8 seconds:

select log_timestamp, event from logs order by log_timestamp desc limit 5;

Query 3: However, this query, very similar to the previous, takes 8 minutes to run!

select log_timestamp, message from logs order by log_timestamp desc limit 5;

Query 4: Finally, this query, identical to the slow one but with explicit range limits, is very fast (~3s):

select log_timestamp, message from logs where log_timestamp > '2014-06-18' order by log_timestamp desc limit 5;

The message column is defined to be able to hold larger messages, but in practice it doesn't hold much data: the average length of the message field is 16 charachters (std_dev 10). The average length of the event field is 5 charachters (std_dev 2). The only distinction I can really see is the max length of the VARCHAR field, but I wouldn't think that should have an order of magnitude affect on the time a simple query takes to return!

Any insight would be appreciated. While this isn't the typical use case for this tool (we'll be aggregating far more than we'll be inspecting individual logs), I'd like to understand any subtle or not-so-subtle affects of my table design.

Thanks!

Dave

like image 763
DaveA Avatar asked Jun 19 '14 16:06

DaveA


People also ask

Does VARCHAR length matter in Redshift?

Redshift compresses column data, so over-sized VARCHAR columns should not affect table size.

What is the max size for a VARCHAR in Redshift?

You can create an Amazon Redshift table with a TEXT column, but it is converted to a VARCHAR(256) column that accepts variable-length values with a maximum of 256 characters.

Why is Redshift query so slow?

Dataset size – A higher volume of data in the cluster can slow query performance for queries, because more rows need to be scanned and redistributed. You can mitigate this effect by regular vacuuming and archiving of data, and by using a predicate to restrict the query dataset.

Which of the below Redshift features helps in query performance improvement?

Amazon Redshift is optimized to reduce your storage footprint and improve query performance by using compression encodings. When you don't use compression, data consumes additional space and requires additional disk I/O. Applying compression to large uncompressed columns can have a big impact on your cluster.


2 Answers

Redshift is a "true columnar" database and only reads columns that are specified in your query. So, when you specify 2 small columns, only those 2 columns have to be read at all. However when you add in the 3rd large column then the work that Redshift has to do dramatically increases.

This is very different from a "row store" database (SQL Server, MySQL, Postgres, etc.) where the entire row is stored together. In a row store adding/removing query columns does not make much difference in response time because the database has to read the whole row anyway.

Finally the reason your last query is very fast is because you've told Redshift that it can skip a large portion of the data. Redshift stores your each column in "blocks" and these blocks are sorted according the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned.

The limit clause doesn't reduce the work that has to be done because you've told Redshift that it must first order all by log_timestamp descending. The problem is your ORDER BY … DESC has to be executed over the entire potential result set before any data can be returned or discarded. When the columns are small that's fast, when they're big it's slow.

like image 51
Joe Harris Avatar answered Oct 12 '22 13:10

Joe Harris


Out of curiosity, how long does this take?

select log_timestamp, message
from logs l join
     (select min(log_timestamp) as log_timestamp
      from (select log_timestamp
            from logs
            order by log_timestamp desc
            limit 5
           ) lt
     ) lt
     on l.log_timestamp >= lt.log_timestamp;
like image 37
Gordon Linoff Avatar answered Oct 12 '22 15:10

Gordon Linoff