I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises. Query #1: <pre class="prettyprint"><code>SELECT id FROM subsource_position WHERE id NOT IN (SELECT position_id FROM subsource) </code></pre> This comes back with the following plan: <pre class="prettyprint"><code> QUERY PLAN ------------------------------------------------------------------------------- Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..2566.50 rows=101500 width=4) -> Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4) </code></pre> Query #2: <pre class="prettyprint"><code>SELECT id FROM subsource_position EXCEPT SELECT position_id FROM subsource; </code></pre> Plan: <pre class="prettyprint"><code> QUERY PLAN ------------------------------------------------------------------------------------------------- SetOp Except (cost=24760.35..25668.66 rows=95997 width=4) -> Sort (cost=24760.35..25214.50 rows=181663 width=4) Sort Key: "*SELECT* 1".id -> Append (cost=0.00..6406.26 rows=181663 width=4) -> Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4) -> Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4) -> Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4) -> Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4) (8 rows) </code></pre> I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this <code>NOT IN</code> to optimize well; is <code>NOT IN</code> always a performance problem or is there a reason it does not optimize here? Additional data: <pre class="prettyprint"><code>=> select count(*) from subsource; count ------- 85158 (1 row) => select count(*) from subsource_position; count ------- 93261 (1 row) </code></pre> Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows. Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that. Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.

Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (<code>Seq Scan</code>). Not having EXCEPT, the alternative is to use a JOIN (<code>HASH JOIN</code>): <pre class="prettyprint lang-sql prettyprint-override"><code> SELECT sp.id FROM subsource_position AS sp LEFT JOIN subsource AS s ON (s.position_id = sp.id) WHERE s.position_id IS NULL </code></pre> EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.

PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

Tags:

sql

postgresql

I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.

Query #1:

SELECT id  FROM subsource_position WHERE   id NOT IN (SELECT position_id FROM subsource)

This comes back with the following plan:

                                  QUERY PLAN                                    -------------------------------------------------------------------------------  Seq Scan on subsource_position  (cost=0.00..362486535.10 rows=128524 width=4)    Filter: (NOT (SubPlan 1))    SubPlan 1      ->  Materialize  (cost=0.00..2566.50 rows=101500 width=4)            ->  Seq Scan on subsource  (cost=0.00..1662.00 rows=101500 width=4)

Query #2:

SELECT id FROM subsource_position EXCEPT SELECT position_id FROM subsource;

Plan:

                                           QUERY PLAN                                             -------------------------------------------------------------------------------------------------  SetOp Except  (cost=24760.35..25668.66 rows=95997 width=4)    ->  Sort  (cost=24760.35..25214.50 rows=181663 width=4)          Sort Key: "*SELECT* 1".id          ->  Append  (cost=0.00..6406.26 rows=181663 width=4)                ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..4146.94 rows=95997 width=4)                      ->  Seq Scan on subsource_position  (cost=0.00..3186.97 rows=95997 width=4)                ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..2259.32 rows=85666 width=4)                      ->  Seq Scan on subsource  (cost=0.00..1402.66 rows=85666 width=4) (8 rows)

I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN to optimize well; is NOT IN always a performance problem or is there a reason it does not optimize here?

Additional data:

=> select count(*) from subsource;  count  -------  85158 (1 row)  => select count(*) from subsource_position;  count  -------  93261 (1 row)

Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.

Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.

Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.

867

asked Aug 19 '11 17:08

Daniel Lyons

2 Answers

Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).

Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN):

    SELECT sp.id     FROM subsource_position AS sp         LEFT JOIN subsource AS s ON (s.position_id = sp.id)     WHERE         s.position_id IS NULL

EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.

155

answered Sep 24 '22 06:09

Antony Gibbs

Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.

answered Sep 23 '22 06:09

Magnus Hagander

Related questions
                            
                                How to use aliases with MySQL LEFT JOIN
                            
                                why varbinary instead of varchar [duplicate]
                            
                                How to group latitude/longitude points that are 'close' to each other?
                            
                                Is CROSS JOIN a synonym for INNER JOIN without ON clause?
                            
                                Oracle row count of table by count(*) vs NUM_ROWS from DBA_TABLES
                            
                                Get SqlConnection from DbConnection
                            
                                Is SQL GROUP BY a design flaw? [closed]
                            
                                Oracle SQL - max() with NULL values
                            
                                What does App=EntityFramework do in Sql connection string?
                            
                                find sql table name with a particular column
                            
                                Is it possible to have an indexed view in MySQL?
                            
                                SQL Filter criteria in join criteria or where clause which is more efficient
                            
                                Join one row to multiple rows in another table
                            
                                Restore DB — Error RESTORE HEADERONLY is terminating abnormally.
                            
                                How to BULK INSERT a file into a *temporary* table where the filename is a variable?
                            
                                When should database synonyms be used?
                            
                                SQL function return-type: TABLE vs SETOF records
                            
                                Is there a difference between Surrogate key, Synthetic Key, and Artificial Key?
                            
                                How to programmatically create a Java ResultSet from custom data with no database
                            
                                How to save image in database using C# [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With