Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow performance using NOT IN

Tags:

sql

postgresql

I am trying to make the query to get each year with a count of the number of movies in that year that had casts that all were not male (for each year, count the number of movies in that year that had no males).

These are the tables:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
CASTS (pid, mid, role)  -- pid refers to actor id, mid refers to movie id

This is what I indexed (id for the tables are primary keys, so they're already indexed, or so I assume):

CREATE INDEX gender_index on actor(gender);
CREATE INDEX movie_name_index on movie(name);
CREATE INDEX movie_year_index on movie(year);
CREATE INDEX casts_index on casts(pid, mid, role);
CREATE INDEX casts_pid_index on casts(pid);
CREATE INDEX casts_mid_index on casts(mid);
CREATE INDEX casts_role_index on casts(role);

This is my query:

SELECT m.year, count(m.id)
FROM movie as m
WHERE m.id NOT IN (
    SELECT DISTINCT m.id
    FROM movie as m, casts as c, actor as a
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year

The query takes forever (and never completes), so how can I make this fast? Does using NOT EXISTS help, although I thought the optimizer takes care of this? Do I need to index anything else? Is there another query that is better? I'm using PostgreSQL if this makes any difference.

Here is the EXPLAIN:

"GroupAggregate  (cost=1512539.61..171886457832.52 rows=61 width=8)"
"  Group Key: m.year"
"  ->  Index Scan using movie_year_index on movie m  (cost=1512539.61..171886453988.38 rows=768706 width=8)"
"        Filter: (NOT (SubPlan 1))"
"        SubPlan 1"
"          ->  Materialize  (cost=1512539.18..1732298.66 rows=1537411 width=4)"
"                ->  Unique  (cost=1512539.18..1718605.60 rows=1537411 width=4)"
"                      ->  Merge Join  (cost=1512539.18..1700559.32 rows=7218511 width=4)"
"                            Merge Cond: (m_1.id = c.mid)"
"                            ->  Index Only Scan using movie_pkey on movie m_1  (cost=0.43..57863.94 rows=1537411 width=4)"
"                            ->  Materialize  (cost=1512531.37..1548623.92 rows=7218511 width=4)"
"                                  ->  Sort  (cost=1512531.37..1530577.65 rows=7218511 width=4)"
"                                        Sort Key: c.mid"
"                                        ->  Hash Join  (cost=54546.59..492838.95 rows=7218511 width=4)"
"                                              Hash Cond: (c.pid = a.id)"
"                                              ->  Seq Scan on casts c  (cost=0.00..186246.43 rows=11445843 width=8)"
"                                              ->  Hash  (cost=35248.91..35248.91 rows=1176214 width=4)"
"                                                    ->  Seq Scan on actor a  (cost=0.00..35248.91 rows=1176214 width=4)"
"                                                          Filter: ((gender)::text = 'M'::text)"
like image 504
Jack Avatar asked Jan 04 '23 10:01

Jack


2 Answers

I would try

SELECT m.year, count(m.id)
FROM movie m
WHERE NOT EXISTS (
    SELECT NULL
    FROM casts c, actor a
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year
like image 78
Luc M Avatar answered Jan 08 '23 04:01

Luc M


I would first drop the distinct. I've run into many instances where a distinct in a subquery creates awful query plans. Also, I would maybe try using a left outer join with a filter for null instead of a not exists. Something like

SELECT m_fems.year, count(m_fems.id)
FROM movie as m_fems
    left outer join movie m_males
        inner join casts c on c.mid = m_males.mid
        inner join actor a on a.gender = 'M' and a.cid = c.cid
    on m_males.mid = m_fems.mid
WHERE m_males.mid is null
GROUP BY m_fems.year
ORDER BY m_fems.year

The potentially many-to-one left outer join won't interfere with your count because any matches are filtered out by m_males.mid is null.

Also, apologies if this isn't exactly proper PostgreSQL. I usually deal with SQL Server / T-SQL

like image 45
ashja99 Avatar answered Jan 08 '23 04:01

ashja99