I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.
Query #1:
SELECT id FROM subsource_position WHERE id NOT IN (SELECT position_id FROM subsource)
This comes back with the following plan:
QUERY PLAN ------------------------------------------------------------------------------- Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..2566.50 rows=101500 width=4) -> Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4)
Query #2:
SELECT id FROM subsource_position EXCEPT SELECT position_id FROM subsource;
Plan:
QUERY PLAN ------------------------------------------------------------------------------------------------- SetOp Except (cost=24760.35..25668.66 rows=95997 width=4) -> Sort (cost=24760.35..25214.50 rows=181663 width=4) Sort Key: "*SELECT* 1".id -> Append (cost=0.00..6406.26 rows=181663 width=4) -> Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4) -> Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4) -> Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4) -> Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4) (8 rows)
I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN
to optimize well; is NOT IN
always a performance problem or is there a reason it does not optimize here?
Additional data:
=> select count(*) from subsource; count ------- 85158 (1 row) => select count(*) from subsource_position; count ------- 93261 (1 row)
Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.
Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.
Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.
<> is the standard SQL operator meaning "not equal". Many databases, including postgresql, supports != as a synonym for <> . They're exactly the same in postgresql.
Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.
Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.
Introduction to PostgreSQL EXCLUDE. PostgreSQL excludes statements in PostgreSQL is used to compare any two rows from the specified column or expression by using the operator specified in PostgreSQL. At the time of excluding the column, the comparison operator will return the null or false value as output.
Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan
).
Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN
):
SELECT sp.id FROM subsource_position AS sp LEFT JOIN subsource AS s ON (s.position_id = sp.id) WHERE s.position_id IS NULL
EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.
Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With