Let's say I have a set of items:
A query can be constructed in two ways. Firstly:
SELECT * FROM TABLE WHERE ITEM NOT IN ('item1', 'item2', 'item3', 'item4','item5')
Or, it can be written as:
SELECT * FROM TABLE WHERE ITEM != 'item1' AND ITEM != 'item2' AND ITEM != 'item3' AND ITEM != 'item4' AND ITEM != 'item5'
My question is specifically relating to PostgreSQL.
If != and <> both are the same, which one should be used in SQL queries? Here is the answer – You can use either != or <> both in your queries as both technically same but I prefer to use <> as that is SQL-92 standard.
Difference between SQL Not Equal Operator <> and != to do inequality test between two expressions. Both operators give the same output. The only difference is that '<>' is in line with the ISO standard while '!= ' does not follow ISO standard.
If you can write your query either way, IN is preferred as far as I'm concerned. Show activity on this post. Same for the other one, with 8 times = instead. So yes, the first one will be faster, less comparisons to be done.
In PostgreSQL there's usually a fairly small difference at reasonable list lengths, though IN
is much cleaner conceptually. Very long AND ... <> ...
lists and very long NOT IN
lists both perform terribly, with AND
much worse than NOT IN
.
In both cases, if they're long enough for you to even be asking the question you should be doing an anti-join or subquery exclusion test over a value list instead.
WITH excluded(item) AS ( VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5') ) SELECT * FROM thetable t WHERE NOT EXISTS(SELECT 1 FROM excluded e WHERE t.item = e.item);
or:
WITH excluded(item) AS ( VALUES('item1'), ('item2'), ('item3'), ('item4'),('item5') ) SELECT * FROM thetable t LEFT OUTER JOIN excluded e ON (t.item = e.item) WHERE e.item IS NULL;
(On modern Pg versions both will produce the same query plan anyway).
If the value list is long enough (many tens of thousands of items) then query parsing may start having a significant cost. At this point you should consider creating a TEMPORARY
table, COPY
ing the data to exclude into it, possibly creating an index on it, then using one of the above approaches on the temp table instead of the CTE.
Demo:
CREATE UNLOGGED TABLE exclude_test(id integer primary key); INSERT INTO exclude_test(id) SELECT generate_series(1,50000); CREATE TABLE exclude AS SELECT x AS item FROM generate_series(1,40000,4) x;
where exclude
is the list of values to omit.
I then compare the following approaches on the same data with all results in milliseconds:
NOT IN
list: 3424.596 AND ...
list: 80173.823 VALUES
based JOIN
exclusion: 20.727 VALUES
based subquery exclusion: 20.495 JOIN
, no index on ex-list: 25.183 ... making the CTE-based approach over three thousand times faster than the AND
list and 130 times faster than the NOT IN
list.
Code here: https://gist.github.com/ringerc/5755247 (shield your eyes, ye who follow this link).
For this data set size adding an index on the exclusion list made no difference.
Notes:
IN
list generated with SELECT 'IN (' || string_agg(item::text, ',' ORDER BY item) || ')' from exclude;
AND
list generated with SELECT string_agg(item::text, ' AND item <> ') from exclude;
)NOT IN
to <> ALL
So... you can see that there's a truly huge gap between both IN
and AND
lists vs doing a proper join. What surprised me was how fast doing it with a CTE using a VALUES
list was ... parsing the VALUES
list took almost no time at all, performing the same or slightly faster than the table approach in most tests.
It'd be nice if PostgreSQL could automatically recognise a preposterously long IN
clause or chain of similar AND
conditions and switch to a smarter approach like doing a hashed join or implicitly turning it into a CTE node. Right now it doesn't know how to do that.
See also:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With