Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize large IN condition for Redshift query

I have a ~2TB fully vacuumed Redshift table with a distkey phash (high cardinality, hundreds of millions of values) and compound sortkeys (phash, last_seen).

When I do a query like:

SELECT
    DISTINCT ret_field
FROM
    table
WHERE
    phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae'
    )
AND
    last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

It returns very quickly. However when I increase the number of hashes beyond 10, Redshift converts the IN condition from a bunch of ORs to an array, per http://docs.aws.amazon.com/redshift/latest/dg/r_in_condition.html#r_in_condition-optimization-for-large-in-lists

The problem is when I have a couple dozen phash values, the "optimized" query goes from less than a second response time to over half an hour. In other words it stops using the sortkey and does a full table scan.

Any idea how I can prevent this behavior and retain the use of sortkeys to keep the query quick?

Here is the EXPLAIN difference between <10 hashes and >10 hashes:

Less than 10 (0.4 seconds):

XN Unique  (cost=0.00..157253450.20 rows=43 width=27)
    ->  XN Seq Scan on table  (cost=0.00..157253393.92 rows=22510 width=27)
                Filter: ((((phash)::text = '394e9a527f93377912cbdcf6789787f1'::text) OR ((phash)::text = '4534f9f8f68cc937f66b50760790c795'::text) OR ((phash)::text = '5c8615fa967576019f846b55f11b6e61'::text) OR ((phash)::text = '5d5743a86b5ff3d60b133c6475e7dce0'::text) OR ((phash)::text = '8719c8caa9740bec10f914fc2434cced'::text) OR ((phash)::text = '9b657c9f6bf7c5bbd04b5baf94e61d9e'::text) OR ((phash)::text = 'd7337d324be519abf6dbfd3612aad0c0'::text) OR ((phash)::text = 'ea43b04ac2f84710dd1f775efcd5ab40'::text)) AND (last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone))

More than 10 (45-60 minutes):

XN Unique  (cost=0.00..181985241.25 rows=1717530 width=27)
    ->  XN Seq Scan on table  (cost=0.00..179718164.48 rows=906830708 width=27)
                Filter: ((last_seen >= '2015-10-01 00:00:00'::timestamp without time zone) AND (last_seen <= '2015-10-31 23:59:59'::timestamp without time zone) AND ((phash)::text = ANY ('{33b84c5775b6862df965a0e00478840e,394e9a527f93377912cbdcf6789787f1,3d27b96948b6905ffae503d48d75f3d1,4534f9f8f68cc937f66b50760790c795,5a63cd6686f7c7ed07a614e245da60c2,5c8615fa967576019f846b55f11b6e61,5d5743a86b5ff3d60b133c6475e7dce0,8719c8caa9740bec10f914fc2434cced,9b657c9f6bf7c5bbd04b5baf94e61d9e,d7337d324be519abf6dbfd3612aad0c0,dbf4c743832c72e9c8c3cc3b17bfae5f,ea43b04ac2f84710dd1f775efcd5ab40,fb4b83121cad6d23e6da6c7b14d2724c}'::text[])))
like image 852
Harry Avatar asked Nov 17 '15 18:11

Harry


4 Answers

You can try to create temporary table/subquery:

SELECT DISTINCT t.ret_field
FROM table t
JOIN (
   SELECT '5c8615fa967576019f846b55f11b6e41' AS phash
   UNION ALL 
   SELECT '8719c8caa9740bec10f914fc2434ccfd' AS phash
   UNION ALL
   SELECT '9b657c9f6bf7c5bbd04b5baf94e61dae' AS phash
   -- UNION ALL
) AS sub
   ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';

Alternatively do searching in chunks (if query optimizer merge it to one, use auxiliary table to store intermediate results):

SELECT ret_field
FROM table
WHERE phash IN (
        '5c8615fa967576019f846b55f11b6e41',
        '8719c8caa9740bec10f914fc2434ccfd',
        '9b657c9f6bf7c5bbd04b5baf94e61dae')
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION
SELECT ret_field
FROM table
WHERE phash IN ( ) -- more hashes)
  AND last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
UNION 
-- ...

If query optimizer merge it to one you can try to use temp table for intermediate results

EDIT:

SELECT DISTINCT t.ret_field
FROM table t
JOIN (SELECT ... AS phash
      FROM ...
) AS sub
   ON t.phash = sub.phash
WHERE t.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59';
like image 147
Lukasz Szozda Avatar answered Nov 06 '22 04:11

Lukasz Szozda


It's worth a try to set sortkeys (last_seen, phash), putting last_seen first.

The reason of slowness might be because the leading column for the sort key is phash which looks like a random character. As AWS redshift dev docs says, the timestamp columns should be as the leading column for the sort key if using that for where conditions.

If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. - Choose the Best Sort Key - Amazon Redshift

With this order of the sort key, all columns will be sorted by last_seen, then phash. (What does it mean to have multiple sortkey columns?)

One note is that you have to recreate your table to change the sort key. This will help you to do that.

like image 28
Masashi M Avatar answered Nov 06 '22 04:11

Masashi M


Do you really need DISTINCT ? This operator could be expensive.

I'd try to use LATERAL JOIN. In the query below the table Hashes has a column phash - this is your big batch of hashes. It could be a temp table, a (sub)query, anything.

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN LATERAL
    (
        SELECT table.ret_field
        FROM table
        WHERE
            table.phash = Hashes.phash
            AND table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'
    ) AS T ON true

It is quite likely that optimizer implements LATERAL JOIN as a nested loop. It would loop through all rows in Hashes and for each row run the SELECT FROM table. The inner SELECT should use index that you have on (phash, last_seen). To play it safe include ret_field into the index as well to make it a covering index: (phash, last_seen, ret_field).


There is a very valid point in the answer by @Diego: instead of putting constant phash values into the query, put them in a temporary or permanent table.

I'd like to extend the answer by @Diego and add that it is important that this table with hashes has index, unique index.

So, create a table Hashes with one column phash that has exactly the same type as in your main table.phash. It is important that types match. Make that column a primary key with unique clustered index. Dump your dozens of phash values into the Hashes table.

Then the query becomes a simple INNER JOIN, not lateral:

SELECT DISTINCT T.ret_field
FROM
    Hashes
    INNER JOIN table ON table.phash = Hashes.phash
WHERE
    table.last_seen BETWEEN '2015-10-01 00:00:00' AND '2015-10-31 23:59:59'

It is still important that table has index on (phash, last_seen, ret_field).

Optimizer should be able to take advantage of the fact that both joined tables are sorted by phash column and that it is unique in the Hashes table.

like image 28
Vladimir Baranov Avatar answered Nov 06 '22 05:11

Vladimir Baranov


you can get rid of the "ORs" by inserting the data you want into a temp table and joining it with your actual table.

Here's an example (I'm using a CTE because with the tool Im using is hard to capture the plan when you have more than one SQL statement - but go with a temp table if you can)

select * 
from <my_table>
where checksum in 
(
'd7360f1b600ae9e895e8b38262cee47936fb6ced',
'd1606f795152c73558513909cd59a8bc3ad865a8',
'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84',
'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)

VERSUS

with foo as
(
    select 'd7360f1b600ae9e895e8b38262cee47936fb6ced' as my_key union
    select 'd1606f795152c73558513909cd59a8bc3ad865a8' union
    select 'bb3f6bb3d1a98d35a0f952a53d738ddec5c72c84' union
    select 'b2cad5a92575ed3868ac6e405647c2213eea74a5'
)
select  * 
from <my_table> r 
     join foo f on r.checksum = F.my_key

and here's the plan, as you can see it looks more complex but that's because of the CTE, it wouldn't look that ways on a temp table:

enter image description here

like image 38
Diego Avatar answered Nov 06 '22 04:11

Diego