Poor performance on Amazon Redshift queries based on VARCHAR size

Tags:

I'm building an Amazon Redshift data warehouse, and experiencing unexpected performance impacts based on the defined size of the VARCHAR column. Details are as follows. Three of my columns are shown from pg_table_def:

 schemaname | tablename |     column      |            type             | encoding  | distkey | sortkey | notnull 
------------+-----------+-----------------+-----------------------------+-----------+---------+---------+---------
 public     | logs      | log_timestamp   | timestamp without time zone | delta32k  | f       |       1 | t
 public     | logs      | event           | character varying(256)      | lzo       | f       |       0 | f
 public     | logs      | message         | character varying(65535)    | lzo       | f       |       0 | f

I've recently run Vacuum and Analyze, I have about 100 million rows in the database, and I'm seeing very different performance depending on which columns I include.

Query 1: For instance, the following query takes about 3 seconds:

select log_timestamp from logs order by log_timestamp desc limit 5;

Query 2: A similar query asking for more data runs in 8 seconds:

select log_timestamp, event from logs order by log_timestamp desc limit 5;

Query 3: However, this query, very similar to the previous, takes 8 minutes to run!

select log_timestamp, message from logs order by log_timestamp desc limit 5;

Query 4: Finally, this query, identical to the slow one but with explicit range limits, is very fast (~3s):

select log_timestamp, message from logs where log_timestamp > '2014-06-18' order by log_timestamp desc limit 5;

The message column is defined to be able to hold larger messages, but in practice it doesn't hold much data: the average length of the message field is 16 charachters (std_dev 10). The average length of the event field is 5 charachters (std_dev 2). The only distinction I can really see is the max length of the VARCHAR field, but I wouldn't think that should have an order of magnitude affect on the time a simple query takes to return!

Any insight would be appreciated. While this isn't the typical use case for this tool (we'll be aggregating far more than we'll be inspecting individual logs), I'd like to understand any subtle or not-so-subtle affects of my table design.

Thanks!

Dave

763

asked Jun 19 '14 16:06

DaveA

2 Answers

Redshift is a "true columnar" database and only reads columns that are specified in your query. So, when you specify 2 small columns, only those 2 columns have to be read at all. However when you add in the 3rd large column then the work that Redshift has to do dramatically increases.

This is very different from a "row store" database (SQL Server, MySQL, Postgres, etc.) where the entire row is stored together. In a row store adding/removing query columns does not make much difference in response time because the database has to read the whole row anyway.

Finally the reason your last query is very fast is because you've told Redshift that it can skip a large portion of the data. Redshift stores your each column in "blocks" and these blocks are sorted according the sort key you specified. Redshift keeps a record of the min/max of each block and can skip over any blocks that could not contain data to be returned.

The limit clause doesn't reduce the work that has to be done because you've told Redshift that it must first order all by log_timestamp descending. The problem is your ORDER BY … DESC has to be executed over the entire potential result set before any data can be returned or discarded. When the columns are small that's fast, when they're big it's slow.

answered Oct 12 '22 13:10

Joe Harris

Out of curiosity, how long does this take?

select log_timestamp, message
from logs l join
     (select min(log_timestamp) as log_timestamp
      from (select log_timestamp
            from logs
            order by log_timestamp desc
            limit 5
           ) lt
     ) lt
     on l.log_timestamp >= lt.log_timestamp;

answered Oct 12 '22 15:10

Gordon Linoff

Related questions
                            
                                SQL standard UPSERT call
                            
                                Trying to use ExecuteScalar , and get " Specified cast is not valid " error
                            
                                How to find the hierarchy path for a tree representation
                            
                                PostgreSQL Error: operator does not exist: name = integer
                            
                                To get MYSQL query execution time in Query
                            
                                Combine two queries to check for duplicates in MySQL?
                            
                                How can you continue SQL query when found error?
                            
                                MySQL ERROR 1005: Can't create table (errno: 150)
                            
                                SQL Query using Partition By
                            
                                Why have I got an increased row count after LEFT JOIN
                            
                                SSIS Result set, Foreachloop and Variable
                            
                                Replace NonASCII Characters in MYSQL
                            
                                Dynamic INSERT Statement in Python
                            
                                SQL Server - PIVOT - two columns into rows
                            
                                Oracle: Avoiding NULL value in to_date
                            
                                How to store a Japanese Character in json_encode?
                            
                                SQL Injection and the LIMIT clause
                            
                                JDBI, retrieve data with sql query into customized object(constructor) instead of Map
                            
                                Why is SQL evaluating a WHERE clause that is False?
                            
                                SQL version of VLOOKUP

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With