SQL indexes for "not equal" searches

Tags:

The SQL index allows to find quickly a string which matches my query. Now, I have to search in a big table the strings which do not match. Of course, the normal index does not help and I have to do a slow sequential scan:

essais=> \d phone_idx
Index "public.phone_idx"
 Column | Type 
--------+------
 phone  | text
btree, for table "public.phonespersons"

essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone = '+33 1234567';
                                  QUERY PLAN                                   
-------------------------------------------------------------------------------
 Index Scan using phone_idx on phonespersons  (cost=0.00..8.41 rows=1 width=4)
   Index Cond: (phone = '+33 1234567'::text)
(2 rows)

essais=> EXPLAIN SELECT person FROM PhonesPersons WHERE phone != '+33 1234567';
                              QUERY PLAN                              
----------------------------------------------------------------------
 Seq Scan on phonespersons  (cost=0.00..18621.00 rows=999999 width=4)
   Filter: (phone <> '+33 1234567'::text)
(2 rows)

I understand (see Mark Byers' very good explanations) that PostgreSQL can decide not to use an index when it sees that a sequential scan would be faster (for instance if almost all the tuples match). But, here, "not equal" searches are really slower.

Any way to make these "is not equal to" searches faster?

Here is another example, to address Mark Byers' excellent remarks. The index is used for the '=' query (which returns the vast majority of tuples) but not for the '!=' query:

essais=> \d tld_idx
 Index "public.tld_idx"
     Column      | Type 
-----------------+------
 pg_expression_1 | text
btree, for table "public.emailspersons"

essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) = 'fr';
                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Index Scan using tld_idx on emailspersons  (cost=0.25..4010.79 rows=97033 width=4) (actual time=0.137..261.123 rows=97110 loops=1)
   Index Cond: (tld(email) = 'fr'::text)
 Total runtime: 444.800 ms
(3 rows)

essais=> EXPLAIN ANALYZE SELECT person FROM EmailsPersons WHERE tld(email) != 'fr';
                         QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Seq Scan on emailspersons  (cost=0.00..27129.00 rows=2967 width=4) (actual time=1.004..1031.224 rows=2890 loops=1)
   Filter: (tld(email) <> 'fr'::text)
 Total runtime: 1037.278 ms
(3 rows)

DBMS is PostgreSQL 8.3 (but I can upgrade to 8.4).

315

asked May 19 '10 09:05

bortzmeyer

1 Answers

Possibly it would help to write:

SELECT person FROM PhonesPersons WHERE phone < '+33 1234567'
UNION ALL
SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'

or simply

SELECT person FROM PhonesPersons WHERE phone > '+33 1234567'
                                       OR phone < '+33 1234567'

PostgreSQL should be able to determine that the selectivity of the range operation is very high and to consider using an index for it.

I don't think it can use an index directly to satisfy a not-equals predicate, although it would be nice if it could try re-writing the not-equals as above (if it helps) during planning. If it works, suggest it to the developers ;)

Rationale: searching an index for all values not equal to a certain one requires scanning the full index. By contrast, searching for all elements less than a certain key means finding the greatest non-matching item in the tree and scanning backwards. Similarly, searching for all elements greater than a certain key in the opposite direction. These operations are easy to fulfill using b-tree structures. Also, the statistics that PostgreSQL collects should be able to point out that "+33 1234567" is a known frequent value: by removing the frequency of those and nulls from 1, we have the proportion of rows left to select: the histogram bounds will indicate whether those are skewed to one side or not. But if the exclusion of nulls and that frequent value pushes the proportion of rows remaining low enough (Istr about 20%), an index scan should be appropriate. Check the stats for the column in pg_stats to see what proportion it's actually calculated.

Update: I tried this on a local table with a vaguely similar distribution, and both forms of the above produced something other than a plain seq scan. The latter (using "OR") was a bitmap scan that may actually devolve to just being a seq scan if the bias towards your common value is particularly extreme... although the planner can see that, I don't think it will automatically rewrite to an "Append(Index Scan,Index Scan)" internally. Turning "enable_bitmapscan" off just made it revert to a seq scan.

PS: indexing a text column and using the inequality operators can be an issue, if your database location is not C. You may need to add an extra index that uses text_pattern_ops or varchar_pattern_ops; this is similar to the problem of indexing for column LIKE 'prefix%' predicates.

Alternative: you could create a partial index:

CREATE INDEX PhonesPersonsOthers ON PhonesPersons(phone) WHERE phone <> '+33 1234567'

this will make the <>-using select statement just scan through that partial index: since it excludes most of the entries in the table, it should be small.

answered Oct 23 '22 20:10

araqnid

Related questions
                            
                                Select a portion from a MySQL Blob Field
                            
                                how much safe from SQL-Injection if using hibernate
                            
                                Query for count of distinct values in a rolling date range
                            
                                Slow query with where clause
                            
                                Get the difference between two dates both In Months and days in sql
                            
                                SQL TOP 5000 faster than normal query with less than 5000 result rows?
                            
                                Does Postgresql plpgsql/sql support short circuiting in the where clause?
                            
                                Postgres: Optimizing querying by datetime
                            
                                ERROR 1054 (42S22): Unknown column '‍‍' in 'field list'
                            
                                SQL Error: ORA-00913: too many values
                            
                                Counting relationships in SQLAlchemy
                            
                                Calling DB Function with Entity Framework 6
                            
                                SQL Server: Replace invalid XML characters from a VARCHAR(MAX) field
                            
                                Using Reactive MySQL Databases in Meteor (An Update?)
                            
                                ORDER BY alphanumeric characters only in SQLite
                            
                                Oracle SQL - can I return the "before" state of a column value
                            
                                Delete all tables in Derby DB
                            
                                How do I make DBIx::Class join tables using other operators than `=`?
                            
                                Using a table to provide enum values in MySQL?
                            
                                Greatest not null column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL indexes for "not equal" searches

Tags:

sql

indexing

postgresql

bortzmeyer

People also ask

1 Answers

araqnid

Recent Activity

Donate For Us