I am trying to make the query to get each year with a count of the number of movies in that year that had casts that all were not male (for each year, count the number of movies in that year that had no males).
These are the tables:
ACTOR (id, fname, lname, gender)
MOVIE (id, name, year)
CASTS (pid, mid, role) -- pid refers to actor id, mid refers to movie id
This is what I indexed (id
for the tables are primary keys, so they're already indexed, or so I assume):
CREATE INDEX gender_index on actor(gender);
CREATE INDEX movie_name_index on movie(name);
CREATE INDEX movie_year_index on movie(year);
CREATE INDEX casts_index on casts(pid, mid, role);
CREATE INDEX casts_pid_index on casts(pid);
CREATE INDEX casts_mid_index on casts(mid);
CREATE INDEX casts_role_index on casts(role);
This is my query:
SELECT m.year, count(m.id)
FROM movie as m
WHERE m.id NOT IN (
SELECT DISTINCT m.id
FROM movie as m, casts as c, actor as a
WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year
The query takes forever (and never completes), so how can I make this fast? Does using NOT EXISTS
help, although I thought the optimizer takes care of this? Do I need to index anything else? Is there another query that is better? I'm using PostgreSQL if this makes any difference.
Here is the EXPLAIN
:
"GroupAggregate (cost=1512539.61..171886457832.52 rows=61 width=8)"
" Group Key: m.year"
" -> Index Scan using movie_year_index on movie m (cost=1512539.61..171886453988.38 rows=768706 width=8)"
" Filter: (NOT (SubPlan 1))"
" SubPlan 1"
" -> Materialize (cost=1512539.18..1732298.66 rows=1537411 width=4)"
" -> Unique (cost=1512539.18..1718605.60 rows=1537411 width=4)"
" -> Merge Join (cost=1512539.18..1700559.32 rows=7218511 width=4)"
" Merge Cond: (m_1.id = c.mid)"
" -> Index Only Scan using movie_pkey on movie m_1 (cost=0.43..57863.94 rows=1537411 width=4)"
" -> Materialize (cost=1512531.37..1548623.92 rows=7218511 width=4)"
" -> Sort (cost=1512531.37..1530577.65 rows=7218511 width=4)"
" Sort Key: c.mid"
" -> Hash Join (cost=54546.59..492838.95 rows=7218511 width=4)"
" Hash Cond: (c.pid = a.id)"
" -> Seq Scan on casts c (cost=0.00..186246.43 rows=11445843 width=8)"
" -> Hash (cost=35248.91..35248.91 rows=1176214 width=4)"
" -> Seq Scan on actor a (cost=0.00..35248.91 rows=1176214 width=4)"
" Filter: ((gender)::text = 'M'::text)"
I would try
SELECT m.year, count(m.id)
FROM movie m
WHERE NOT EXISTS (
SELECT NULL
FROM casts c, actor a
WHERE m.id = c.mid and a.id = c.pid and a.gender = 'M'
)
GROUP BY m.year
ORDER BY m.year
I would first drop the distinct
. I've run into many instances where a distinct in a subquery creates awful query plans. Also, I would maybe try using a left outer join
with a filter for null
instead of a not exists
. Something like
SELECT m_fems.year, count(m_fems.id)
FROM movie as m_fems
left outer join movie m_males
inner join casts c on c.mid = m_males.mid
inner join actor a on a.gender = 'M' and a.cid = c.cid
on m_males.mid = m_fems.mid
WHERE m_males.mid is null
GROUP BY m_fems.year
ORDER BY m_fems.year
The potentially many-to-one left outer join won't interfere with your count because any matches are filtered out by m_males.mid is null
.
Also, apologies if this isn't exactly proper PostgreSQL. I usually deal with SQL Server / T-SQL
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With