Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon Redshift Equality filter performance and sortkeys

Does Redshift efficiently (i.e. binary search) find a block of a table that is sorted on a column A for a query with a condition A=?

As an example, let there be a table T with ~500m rows, ~50 fields, distributed and sorted on field A. Field A has high cardinality - so there are ~4.5 m different A values, with exactly the same number of rows in T: ~100 rows per value.
Assume a redshift cluster with a single XL node.
Field A is not compressed. All other fields have some form compression, as suggested by ANALYZE COMPRESSION. A ratio of 1:20 was given compared to an uncompressed table.

Given a trivial query:

select avg(B),avg(C) from
(select B,C from T where A = <val>)

After VACUUM and ANALYZE the following explain plan is given:

XN Aggregate (cost=1.73..1.73 rows=1 width=8)
-> XN Seq Scan on T (cost=0.00..1.23 rows=99 width=8)
Filter: (A = <val>::numeric)

This query takes 39 seconds to complete.
The main question is: Is this the expected behavior of redshift?

According to the documentation at Choosing the best sortkey:
"If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. Redshift can skip reading entire blocks of data for that column because it keeps track of the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range."

In Choosing sort keys:
"Another optimization that depends on sorted data is the efficient handling of range-restricted predicates. Amazon Redshift stores columnar data in 1 MB disk blocks. The min and max values for each block are stored as part of the metadata. If a range-restricted column is a sort key, the query processor is able to use the min and max values to rapidly skip over large numbers of blocks during table scans. For example, if a table stores five years of data sorted by date and a query specifies a date range of one month, up to 98% of the disk blocks can be eliminated from the scan. If the data is not sorted, more of the disk blocks (possibly all of them) have to be scanned. For more information about these optimizations, see Choosing distribution keys."

Secondary questions:
What is the complexity of the aforementioned skipping scan on a sort key? Is it linear ( O(n) ) or some variant of binary search ( O(logn) )?
If a key is sorted - is skipping the only optimization available?
What would this "skipping" optimization look like in the explain plan?
Is the above explain the best one possible for this query?
What is the fastest result redshift can be expected to provide given this scenario?
Does vanilla ParAccel have different behavior in this use case?

like image 664
user2886358 Avatar asked Nov 02 '22 12:11

user2886358


1 Answers

This question is answered on amazon forum: https://forums.aws.amazon.com/thread.jspa?threadID=137610

like image 135
diemacht Avatar answered Nov 11 '22 11:11

diemacht