Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:

SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
  Execution time: 36.75s

versus

SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
  Execution time: 0.2s

Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?

If it helps schema as follows:

CREATE TABLE profile (
    project_id integer not null,
    id varchar(256) not null,
    created timestamp not null,
    /* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);

Explain from SELECT COUNT(*) ...

XN Aggregate  (cost=435.70..435.70 rows=1 width=0)
  ->  XN Seq Scan on profile  (cost=0.00..435.70 rows=1 width=0)
        Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))

Explain from SELECT * ...

XN Seq Scan on profile  (cost=0.00..435.70 rows=1 width=7356)
    Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))

Why is the non count much slower? Surely Redshift knows the row doesn't exist?

like image 642
AndySavage Avatar asked Mar 05 '26 05:03

AndySavage


1 Answers

The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.

like image 168
Yuri Levinsky Avatar answered Mar 07 '26 01:03

Yuri Levinsky