Slow performance using NOT IN

Question

I am trying to make the query to get each year with a count of the number of movies in that year that had casts that all were not male (for each year, count the number of movies in that year that had no males).

These are the tables:

ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
CASTS (pid, mid, role)  -- pid refers to actor id, mid refers to movie id

This is what I indexed (id for the tables are primary keys, so they're already indexed, or so I assume):

CREATE INDEX gender_index on actor(gender);
CREATE INDEX movie_name_index on movie(name);
CREATE INDEX movie_year_index on movie(year);
CREATE INDEX casts_index on casts(pid, mid, role);
CREATE INDEX casts_pid_index on casts(pid);
CREATE INDEX casts_mid_index on casts(mid);
CREATE INDEX casts_role_index on casts(role);

This is my query:

SELECT m.year, count(m.id)
FROM movie as m
WHERE m.id NOT IN (
    SELECT DISTINCT m.id
    FROM movie as m, casts as c, actor as a
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year

The query takes forever (and never completes), so how can I make this fast? Does using NOT EXISTS help, although I thought the optimizer takes care of this? Do I need to index anything else? Is there another query that is better? I'm using PostgreSQL if this makes any difference.

Here is the EXPLAIN:

"GroupAggregate  (cost=1512539.61..171886457832.52 rows=61 width=8)"
"  Group Key: m.year"
"  ->  Index Scan using movie_year_index on movie m  (cost=1512539.61..171886453988.38 rows=768706 width=8)"
"        Filter: (NOT (SubPlan 1))"
"        SubPlan 1"
"          ->  Materialize  (cost=1512539.18..1732298.66 rows=1537411 width=4)"
"                ->  Unique  (cost=1512539.18..1718605.60 rows=1537411 width=4)"
"                      ->  Merge Join  (cost=1512539.18..1700559.32 rows=7218511 width=4)"
"                            Merge Cond: (m_1.id = c.mid)"
"                            ->  Index Only Scan using movie_pkey on movie m_1  (cost=0.43..57863.94 rows=1537411 width=4)"
"                            ->  Materialize  (cost=1512531.37..1548623.92 rows=7218511 width=4)"
"                                  ->  Sort  (cost=1512531.37..1530577.65 rows=7218511 width=4)"
"                                        Sort Key: c.mid"
"                                        ->  Hash Join  (cost=54546.59..492838.95 rows=7218511 width=4)"
"                                              Hash Cond: (c.pid = a.id)"
"                                              ->  Seq Scan on casts c  (cost=0.00..186246.43 rows=11445843 width=8)"
"                                              ->  Hash  (cost=35248.91..35248.91 rows=1176214 width=4)"
"                                                    ->  Seq Scan on actor a  (cost=0.00..35248.91 rows=1176214 width=4)"
"                                                          Filter: ((gender)::text = 'M'::text)"

Luc M · Accepted Answer

I would try

SELECT m.year, count(m.id)
FROM movie m
WHERE NOT EXISTS (
    SELECT NULL
    FROM casts c, actor a
    WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year

ashja99 · Answer

I would first drop the distinct. I've run into many instances where a distinct in a subquery creates awful query plans. Also, I would maybe try using a left outer join with a filter for null instead of a not exists. Something like

SELECT m_fems.year, count(m_fems.id)
FROM movie as m_fems
    left outer join movie m_males
        inner join casts c on c.mid = m_males.mid
        inner join actor a on a.gender = 'M' and a.cid = c.cid
    on m_males.mid = m_fems.mid
WHERE m_males.mid is null
GROUP BY m_fems.year
ORDER BY m_fems.year

The potentially many-to-one left outer join won't interfere with your count because any matches are filtered out by m_males.mid is null.

Also, apologies if this isn't exactly proper PostgreSQL. I usually deal with SQL Server / T-SQL

Slow performance using NOT IN

Tags:

sql

postgresql

Jack

2 Answers

Luc M

ashja99

Recent Activity

Donate For Us

Slow performance using NOT IN

Tags:

sql

postgresql

Jack

2 Answers

Luc M

ashja99

Related questions

Recent Activity

Donate For Us