I have a ~2TB fully vacuumed Redshift table with a distkey phash
(high cardinality, hundreds of millions of values) and compound sortkeys (phash, last_seen)
.
When I do a query like:
SELECT
DISTINCT ret_field
FROM
table
WHERE
phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae'
)
AND
last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
It returns very quickly. However when I increase the number of hashes beyond 10, Redshift converts the IN condition from a bunch of ORs to an array, per http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists
The problem is when I have a couple dozen phash
values, the "optimized" query goes from less than a second response time to over half an hour. In other words it stops using the sortkey and does a full table scan.
Any idea how I can prevent this behavior and retain the use of sortkeys to keep the query quick?
Here is the EXPLAIN
difference between <10 hashes and >10 hashes:
Less than 10 (0.4 seconds):
XN Unique (cost=0.00..157253450.20 rows=43 width=27)
-> XN Seq Scan on table (cost=0.00..157253393.92 rows=22510 width=27)
Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))
More than 10 (45-60 minutes):
XN Unique (cost=0.00..181985241.25 rows=1717530 width=27)
-> XN Seq Scan on table (cost=0.00..179718164.48 rows=906830708 width=27)
Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))
You can try to create temporary table/subquery:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (
SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
UNION ALL
SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
UNION ALL
SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
-- UNION ALL
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
Alternatively do searching in chunks (if query optimizer merge it to one, use auxiliary table to store intermediate results):
SELECT ret_field
FROM table
WHERE phash IN (
'5c8615fa967576019f846b55f11b6e41',
'8719c8caa9740bec10f914fc2434ccfd',
'9b657c9f6bf7c5bbd04b5baf94e61dae')
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
-- ...
If query optimizer merge it to one you can try to use temp table for intermediate results
EDIT:
SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
FROM ...
) AS sub
ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
It's worth a try to set sortkeys (last_seen, phash)
, putting last_seen
first.
The reason of slowness might be because the leading column for the sort key is phash
which looks like a random character.
As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.
If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift
With this order of the sort key, all columns will be sorted by last_seen
, then phash
. (What does it mean to have multiple sortkey columns?)
One note is that you have to recreate your table to change the sort key. This will help you to do that.
Do you really need DISTINCT
? This operator could be expensive.
I'd try to use LATERAL JOIN
. In the query below the table Hashes
has a column phash
- this is your big batch of hashes. It could be a temp table, a (sub)query, anything.
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN LATERAL
(
SELECT table.ret_field
FROM table
WHERE
table.phash = Hashes.phash
AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
) AS T ON true
It is quite likely that optimizer implements LATERAL JOIN
as a nested loop. It would loop through all rows in Hashes
and for each row run the SELECT FROM table
. The inner SELECT
should use index that you have on (phash, last_seen)
. To play it safe include ret_field
into the index as well to make it a covering index: (phash, last_seen, ret_field)
.
There is a very valid point in the answer by @Diego: instead of putting constant phash
values into the query, put them in a temporary or permanent table.
I'd like to extend the answer by @Diego and add that it is important that this table with hashes has index, unique index.
So, create a table Hashes
with one column phash
that has exactly the same type as in your main table.phash
. It is important that types match. Make that column a primary key with unique clustered index. Dump your dozens of phash
values into the Hashes
table.
Then the query becomes a simple INNER JOIN
, not lateral:
SELECT DISTINCT T.ret_field
FROM
Hashes
INNER JOIN table ON table.phash = Hashes.phash
WHERE
table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
It is still important that table
has index on (phash, last_seen, ret_field)
.
Optimizer should be able to take advantage of the fact that both joined tables are sorted by phash
column and that it is unique in the Hashes
table.
you can get rid of the "ORs" by inserting the data you want into a temp table and joining it with your actual table.
Here's an example (I'm using a CTE because with the tool Im using is hard to capture the plan when you have more than one SQL statement - but go with a temp table if you can)
select *
from <my_table>
where checksum in
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
VERSUS
with foo as
(
select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select *
from <my_table> r
join foo f on r.checksum = F.my_key
and here's the plan, as you can see it looks more complex but that's because of the CTE, it wouldn't look that ways on a temp table:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With