Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Redshift column limit?

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.

What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.

More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.

SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'

My three questions are:

  1. What's the deal? Why does Redshift have this limitation if it claims to be a column store?
  2. Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
  3. Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.
like image 479
mellocello Avatar asked Sep 03 '15 15:09

mellocello


People also ask

Does Redshift have a limit?

Amazon Redshift has quotas that limit the use of several resources in your AWS account per AWS Region. There is a default value for each quota and some quotas are adjustable. For adjustable quotas, you can request an increase for your AWS account in an AWS Region by submitting an Amazon Redshift Limit Increase Form .

How much data can Redshift handle?

Redshift allows up to 16 petabytes of data on a cluster compared to Amazon RDS Aurora's maximum size of 128 terabytes. Amazon Redshift is based on an older version of PostgreSQL 8.0.

How do you increase the column length in Redshift?

No, you can't increase the column size in Redshift without recreating the table. But if the column is last column in the table you can add new column with required changes and move the data and then old column can be dropped as below.

What is the max VARCHAR size in Redshift?

You can create an Amazon Redshift table with a TEXT column, but it is converted to a VARCHAR(256) column that accepts variable-length values with a maximum of 256 characters.


1 Answers

I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.

I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.

  • 1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
  • Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.

Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?

FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

like image 194
Joe Harris Avatar answered Oct 04 '22 23:10

Joe Harris