AWS Redshift column limit?

Tags:

I've been doing some load testing of AWS Redshift for a new application, and I noticed that it has a column limit of 1600 per table. Worse, queries slow down as the number of columns increases in a table.

What doesn't make any sense here is that Redshift is supposed to be a column-store database, and there shouldn't in theory be an I/O hit from columns that are not selected in a particular where clause.

More specifically, when TableName is 1600 columns, I found that the below query is substantially slower than if TableName were, say, 1000 columns and the same number of rows. As the number of columns decreases, performance improves.

SELECT COUNT(1) FROM TableName
WHERE ColumnName LIKE '%foo%'

My three questions are:

What's the deal? Why does Redshift have this limitation if it claims to be a column store?
Any suggestions for working around this limitation? Joins of multiple smaller tables seems to eventually approximate the performance of a single table. I haven't tried pivoting the data.
Does anyone have a suggestion for a fast, real-time performance, horizontally scalable column-store database that doesn't have the above limitations? All we're doing is count queries with simple where restrictions against approximately 10M (rows) x 2500 (columns) data.

479

asked Sep 03 '15 15:09

mellocello

1 Answers

I can't explain precisely why it slows down so much but I can verify that we've experienced the same thing.

I think part of the issue is that Redshift stores a minimum of 1MB per column per node. Having a lot of columns creates a lot of disk seek activity and I/O overhead.

1MB blocks are problematic because most of that will be empty space but it will still be read off of the disk
Having lots of blocks means that column data will not be located as close together so Redshift has to do a lot more work to find them.

Also, (just occurred to me) I suspect that Redshift's MVCC controls add a lot of overhead. It tries to ensure you get a consistent read while your query is executing and presumably that requires making a note of all the blocks for tables in your query, even blocks for columns that are not used. Why is an implicit table lock being released prior to end of transaction in RedShift?

FWIW, our columns were virtually all BOOLEAN and we've had very good results from compacting them (bit masking) into INT/BIGINTs and accessing the values using the bit-wise functions. One example table went from 1400 cols (~200GB) to ~60 cols (~25GB) and the query times improved more than 10x (30-40 down to 1-2 secs).

194

answered Oct 04 '22 23:10

Joe Harris

Related questions
                            
                                Group the rows that are having the same value in specific field in MySQL
                            
                                select subquery inside then of case when statement?
                            
                                SQL Server Create Table with Foreign Key
                            
                                how to return the average of a sql time field
                            
                                How to retrieve table and column names from SQl using JSQLPARSE
                            
                                sql server- When does table get locked when updating with join
                            
                                Merge data into two destination tables
                            
                                Storing partial dates in a database
                            
                                CakePHP 1.3 - Unknown column in where clause
                            
                                Laravel 4 query builder - with complicated left joins
                            
                                PostgreSQL - Create table and set specific date format
                            
                                Hibernate lazy loading not work with many-to-one mapping
                            
                                Microsoft SQL Server backup physical_device_name
                            
                                How to use a group by Sum SQL with Spring Data JPA?
                            
                                Invalid Identifier on Sql left join oracle
                            
                                Remove duplicates with less null values
                            
                                Oracle performance: query executing multiple identical function calls
                            
                                PostgreSQL find all possible combinations (permutations) in recursive query
                            
                                Not Exists vs Not In: efficiency
                            
                                How get the Id of inserted row in SQL [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

AWS Redshift column limit?

Tags:

sql

database-performance

amazon-redshift

mellocello

People also ask

1 Answers

Joe Harris

Recent Activity

Donate For Us