Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres ANTI-JOIN needs Table-Scan?

I need an ANTI-JOIN (not exists SELECT something from table.../ left join table WHERE table.id IS NULL) on the same table. Acutally I have an index to serve the not exists question, but the query planner chooses to use a bitmap heap scan.

The table has 100 Million rows, so doing a heap scan is messed up...

It would be really fast if Postgres could compare to the indicies. Does Postgres have to visit the table for this ANTI-JOIN?

I know the table has to be visited at some point to serve the MVCC, but why so early? Can NOT EXISTS only be fixed by the table, because it could miss something otherwise?

like image 446
Franz Kafka Avatar asked Jan 20 '23 06:01

Franz Kafka


2 Answers

You'll need to provide version details, and as jmz says EXPLAIN ANALYSE output to get any useful advice.

Franz - don't think whether it's possible, test and know.

This is v9.0:

CREATE TABLE tl (i int, t text);
CREATE TABLE tr (i int, t text);
INSERT INTO tl SELECT s, 'text ' || s FROM generate_series(1,999999) s;
INSERT INTO tr SELECT s, 'text ' || s FROM generate_series(1,999999) s WHERE s % 3 = 0;
ALTER TABLE tl add primary key (i);
CREATE INDEX tr_i_idx ON tr (i);
ANALYSE;
EXPLAIN ANALYSE SELECT i,t FROM tl LEFT JOIN tr USING (i) WHERE tr.i IS NULL;
                                                         QUERY PLAN                                                      
-----------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.95..45611.86 rows=666666 width=15) (actual time=0.040..4011.970 rows=666666 loops=1)
   Merge Cond: (tl.i = tr.i)
   ->  Index Scan using tl_pkey on tl  (cost=0.00..29201.32 rows=999999 width=15) (actual time=0.017..1356.996 rows=999999 lo
   ->  Index Scan using tr_i_idx on tr  (cost=0.00..9745.27 rows=333333 width=4) (actual time=0.015..439.087 rows=333333 loop
 Total runtime: 4602.224 ms

What you see will depend on your version, and the stats the planner sees.

like image 80
Richard Huxton Avatar answered Jan 31 '23 09:01

Richard Huxton


My (simplified) query:

SELECT a.id FROM a LEFT JOIN b ON b.id = a.id WHERE b.id IS NULL ORDER BY id;

The query plan like this works:

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=0.57..3831.88 rows=128092 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.42..3399.70 rows=130352 width=8)
   ->  Index Only Scan using b_pkey on b  (cost=0.15..78.06 rows=2260 width=8)
(4 rows)

However, sometimes postgresql 9.5.9 would switch to a sequential scan if the planner thought it might be better (see Why does PostgreSQL perform sequential scan on indexed column?). However, in my case it made things worse.

                                                       QUERY PLAN                                                        
-------------------------------------------------------------------------------------------------------------------------
 Merge Anti Join  (cost=405448.22..39405858.08 rows=1365191502 width=8)
   Merge Cond: (a.id = b.id)
   ->  Index Only Scan using a_pkey on a  (cost=0.58..35528317.86 rows=1368180352 width=8)
   ->  Materialize  (cost=405447.64..420391.89 rows=2988850 width=8)
         ->  Sort  (cost=405447.64..412919.76 rows=2988850 width=8)
               Sort Key: b.id
               ->  Seq Scan on b  (cost=0.00..43113.50 rows=2988850 width=8)
(7 rows)

My (hack) solution was to discourage sequential scans by:

set enable_seqscan to off;

The postgresql documentation says the proper way to do this is to the seq_page_cost using ALTER TABLESPACE. This might be advisable when using ORDER BY on indexed columns, but I'm not sure. https://www.postgresql.org/docs/9.1/static/runtime-config-query.html

like image 29
TimSC Avatar answered Jan 31 '23 09:01

TimSC