Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

Tags:

sql

postgresql

I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.

Query #1:

SELECT id  FROM subsource_position WHERE   id NOT IN (SELECT position_id FROM subsource) 

This comes back with the following plan:

                                  QUERY PLAN                                    -------------------------------------------------------------------------------  Seq Scan on subsource_position  (cost=0.00..362486535.10 rows=128524 width=4)    Filter: (NOT (SubPlan 1))    SubPlan 1      ->  Materialize  (cost=0.00..2566.50 rows=101500 width=4)            ->  Seq Scan on subsource  (cost=0.00..1662.00 rows=101500 width=4) 

Query #2:

SELECT id FROM subsource_position EXCEPT SELECT position_id FROM subsource; 

Plan:

                                           QUERY PLAN                                             -------------------------------------------------------------------------------------------------  SetOp Except  (cost=24760.35..25668.66 rows=95997 width=4)    ->  Sort  (cost=24760.35..25214.50 rows=181663 width=4)          Sort Key: "*SELECT* 1".id          ->  Append  (cost=0.00..6406.26 rows=181663 width=4)                ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..4146.94 rows=95997 width=4)                      ->  Seq Scan on subsource_position  (cost=0.00..3186.97 rows=95997 width=4)                ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..2259.32 rows=85666 width=4)                      ->  Seq Scan on subsource  (cost=0.00..1402.66 rows=85666 width=4) (8 rows) 

I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN to optimize well; is NOT IN always a performance problem or is there a reason it does not optimize here?

Additional data:

=> select count(*) from subsource;  count  -------  85158 (1 row)  => select count(*) from subsource_position;  count  -------  93261 (1 row) 

Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.

Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.

Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.

like image 867
Daniel Lyons Avatar asked Aug 19 '11 17:08

Daniel Lyons


People also ask

What does <> mean in PostgreSQL?

<> is the standard SQL operator meaning "not equal". Many databases, including postgresql, supports != as a synonym for <> . They're exactly the same in postgresql.

Does number of columns affect performance in Postgres?

Yes the number of columns will - indirectly - influence the performance. The data in the columns will also affect the speed.

How improve SQL query PostgreSQL performance?

Some of the tricks we used to speed up SELECT-s in PostgreSQL: LEFT JOIN with redundant conditions, VALUES, extended statistics, primary key type conversion, CLUSTER, pg_hint_plan + bonus.

What does excluded do in Postgres?

Introduction to PostgreSQL EXCLUDE. PostgreSQL excludes statements in PostgreSQL is used to compare any two rows from the specified column or expression by using the operator specified in PostgreSQL. At the time of excluding the column, the comparison operator will return the null or false value as output.


2 Answers

Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).

Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN):

    SELECT sp.id     FROM subsource_position AS sp         LEFT JOIN subsource AS s ON (s.position_id = sp.id)     WHERE         s.position_id IS NULL 

EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.

like image 155
Antony Gibbs Avatar answered Sep 24 '22 06:09

Antony Gibbs


Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.

like image 22
Magnus Hagander Avatar answered Sep 23 '22 06:09

Magnus Hagander