Let's say I have a set of items: <ul> <li>Item1</li> <li>Item2</li> <li>Item3</li> <li>Item4</li> <li>Item5</li> </ul> A query can be constructed in two ways. Firstly: <pre class="prettyprint"><code>SELECT * FROM TABLE WHERE ITEM NOT IN ('item1', 'item2', 'item3', 'item4','item5') </code></pre> Or, it can be written as: <pre class="prettyprint"><code>SELECT * FROM TABLE WHERE ITEM != 'item1' AND ITEM != 'item2' AND ITEM != 'item3' AND ITEM != 'item4' AND ITEM != 'item5' </code></pre> <ol> <li>Which is more efficient and why?</li> <li>At what point does one become more efficient than the other? In other words, what if there were 500 items? </li> </ol> My question is specifically relating to PostgreSQL.

In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though <code>IN</code> is much cleaner conceptually. Very long <code>AND ... <> ...</code> lists and very long <code>NOT IN</code> lists both perform terribly, with <code>AND</code> much worse than <code>NOT IN</code>. In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead. <pre class="prettyprint"><code>WITH excluded(item) AS ( VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5') ) SELECT * FROM thetable t WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item); </code></pre> or: <pre class="prettyprint"><code>WITH excluded(item) AS ( VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5') ) SELECT * FROM thetable t LEFT OUTER JOIN excluded e ON (t.item = e.item) WHERE e.item IS NULL; </code></pre> (On modern Pg versions both will produce the same query plan anyway). If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a <code>TEMPORARY</code> table, <code>COPY</code>ing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE. Demo: <pre class="prettyprint"><code>CREATE UNLOGGED TABLE exclude_test(id integer primary key); INSERT INTO exclude_test(id) SELECT generate_series(1,50000); CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x; </code></pre> where <code>exclude</code> is the list of values to omit. I then compare the following approaches on the same data with all results in milliseconds: <ul> <li> <code>NOT IN</code> list: 3424.596 </li> <li> <code>AND ...</code> list: 80173.823 </li> <li> <code>VALUES</code> based <code>JOIN</code> exclusion: 20.727 </li> <li> <code>VALUES</code> based subquery exclusion: 20.495 </li> <li>Table-based <code>JOIN</code>, no index on ex-list: 25.183 </li> <li>Subquery table based, no index on ex-list: 23.985 </li> </ul> ... making the CTE-based approach over three thousand times faster than the <code>AND</code> list and 130 times faster than the <code>NOT IN</code> list. Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link). For this data set size adding an index on the exclusion list made no difference. Notes: <ul> <li> <code>IN</code> list generated with <code>SELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;</code> </li> <li> <code>AND</code> list generated with <code>SELECT string_agg(item::text, ' AND item <> ') from exclude;</code>)</li> <li>Subquery and join based table exclusion were much the same across repeated runs.</li> <li>Examination of the plan shows that Pg translates <code>NOT IN</code> to <code><> ALL</code> </li> </ul> So... you can see that there's a truly huge gap between both <code>IN</code> and <code>AND</code> lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a <code>VALUES</code> list was ... parsing the <code>VALUES</code> list took almost no time at all, performing the same or slightly faster than the table approach in most tests. It'd be nice if PostgreSQL could automatically recognise a preposterously long <code>IN</code> clause or chain of similar <code>AND</code> conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that. See also: <ul> <li>this handy blog post Magnus Hagander wrote on the topic</li> </ul>

SQL: When it comes to NOT IN and NOT EQUAL TO, which is more efficient and why?

Tags:

Let's say I have a set of items:

Item1
Item2
Item3
Item4
Item5

A query can be constructed in two ways. Firstly:

SELECT *  FROM TABLE  WHERE ITEM NOT IN ('item1', 'item2', 'item3', 'item4','item5')

Or, it can be written as:

SELECT *  FROM TABLE  WHERE ITEM != 'item1'    AND ITEM != 'item2'    AND ITEM != 'item3'    AND ITEM != 'item4'    AND ITEM != 'item5'

Which is more efficient and why?
At what point does one become more efficient than the other? In other words, what if there were 500 items?

My question is specifically relating to PostgreSQL.

776

asked Jun 11 '13 06:06

coderama

1 Answers

In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though IN is much cleaner conceptually. Very long AND ... <> ... lists and very long NOT IN lists both perform terribly, with AND much worse than NOT IN.

In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead.

WITH excluded(item) AS (     VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5') ) SELECT *  FROM thetable t WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item);

or:

WITH excluded(item) AS (     VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5') ) SELECT *  FROM thetable t LEFT OUTER JOIN excluded e ON (t.item = e.item) WHERE e.item IS NULL;

(On modern Pg versions both will produce the same query plan anyway).

If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a TEMPORARY table, COPYing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE.

Demo:

CREATE UNLOGGED TABLE exclude_test(id integer primary key); INSERT INTO exclude_test(id) SELECT generate_series(1,50000); CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x;

where exclude is the list of values to omit.

I then compare the following approaches on the same data with all results in milliseconds:

NOT IN list: 3424.596
AND ... list: 80173.823
VALUES based JOIN exclusion: 20.727
VALUES based subquery exclusion: 20.495
Table-based JOIN, no index on ex-list: 25.183
Subquery table based, no index on ex-list: 23.985

... making the CTE-based approach over three thousand times faster than the AND list and 130 times faster than the NOT IN list.

Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link).

For this data set size adding an index on the exclusion list made no difference.

Notes:

IN list generated with SELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;
AND list generated with SELECT string_agg(item::text, ' AND item <> ') from exclude;)
Subquery and join based table exclusion were much the same across repeated runs.
Examination of the plan shows that Pg translates NOT IN to <> ALL

So... you can see that there's a truly huge gap between both IN and AND lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a VALUES list was ... parsing the VALUES list took almost no time at all, performing the same or slightly faster than the table approach in most tests.

It'd be nice if PostgreSQL could automatically recognise a preposterously long IN clause or chain of similar AND conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that.

Craig Ringer

Related questions
                            
                                C++11 on Windows
                            
                                Hide radio button icon but not text
                            
                                .bashrc: Permission denied
                            
                                Create counter within consecutive runs of values
                            
                                Conversion to Dalvik format failed: Unable to execute dex: java.nio.BufferOverflowException
                            
                                Dictionary Keys and Values to a Select List
                            
                                No architectures to compile for (ONLY_ACTIVE_ARCH=YES, active arch=x86_64, VALID_ARCHS=i386)
                            
                                How to install missing woocommerce pages from tools menu in woocommerce 2.1.5
                            
                                Required XML attribute "adSize" was missing
                            
                                How to move from one fragment to another fragment on click of an ImageView in Android?
                            
                                C# App.Config with array or list like data
                            
                                UIView animateWithDuration and UIProgressView setProgress

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With