Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

Tags:

I've visited one interesting job interview recently. There I was asked a question about optimizing a query with a WHERE..IN clause containing long list of scalars (thousands of values, that is). This question is NOT about subqueries in the IN clause, but about simple list of scalars.

I answered right away, that this can be optimized using an INNER JOIN with another table (possibly temporary one), which will contain only those scalars. My answer was accepted and there was a note from the reviewer, that "no database engine currently can optimize long WHERE..IN conditions to be performant enough". I nodded.

But when I walked out, I started to have some doubts. The condition seemed rather trivial and widely used for modern RDBMS not to be able to optimize it. So, I started some digging.

PostgreSQL:

It seems, that PostgreSQL parse scalar IN() constructions into ScalarArrayOpExpr structure, which is sorted. This structure is later used during index scan to locate matching rows. EXPLAIN ANALYZE for such queries shows only one loop. No joins are done. So, I expect such query to be even faster, than INNER JOIN. I tried some queries on my existing database and my tests proved that position. But I didn't care about test purity and that Postgres was under Vagrant so I might be wrong.

MSSQL Server:

MSSQL Server builds a hash structure from the list of constant expressions and then does a hash join with the source table. Even though no sorting seems to be done, that is a performance match, I think. I didn't do any tests since I don't have any experience with this RDBMS.

MySQL Server:

The 13th of these slides says, that before 5.0 this problem indeed took place in MySQL with some cases. But other than that, I didn't find any other problem related to bad IN () treatment. I didn't find any proofs of the inverse, unfortunately. If you did, please kick me.

SQLite:

Documentation page hints some problems, but I tend to believe things described there are really at conceptual level. No other information was found.

So, I'm starting to think I misunderstood my interviewer or misused Google ;) Or, may be, it's because we didn't set any conditions and our talk became a little vague (we didn't specify any concrete RDBMS or other conditions. That was just abstract talk).

It looks like the days, where databases rewrote IN() as a set of OR statements (which can cause problems sometimes with NULL values in lists, btw) are long ago. Or not?

Of course, in cases where a list of scalars is longer than allowed database protocol packet, INNER JOIN might be the only solution available.

I think in some cases query parsing time (if it was not prepared) alone can kill performance.

Also databases could be unable to prepare IN(?) query which will lead to reparsing it again and again (which may kill performance). Actually, I never tried, but I think that even in such cases query parsing and planning is not huge comparing to query execution.

But other than that I do not see other problems. Well, other than the problem of just HAVING this problem. If you have queries, that contain thousands of IDs inside, something is wrong with your architecture.

Do you?

441

asked Dec 02 '15 15:12

Vladislav Rastrusny

1 Answers

Your answer is only correct if you build an index (preferably a primary key index) on the list, unless the list is really small.

Any description of optimization is definitely database specific. However, MySQL is quite specific about how it optimizes in:

Returns 1 if expr is equal to any of the values in the IN list, else returns 0. If all values are constants, they are evaluated according to the type of expr and sorted. The search for the item then is done using a binary search. This means IN is very quick if the IN value list consists entirely of constants.

This would definitely be a case where using IN would be faster than using another table -- and probably faster than another table using a primary key index.

I think that SQL Server replaces the IN with a list of ORs. These would then be implemented as sequential comparisons. Note that sequential comparisons can be faster than a binary search, if some elements are much more common than others and those appear first in the list.

124

answered Oct 24 '22 01:10

Gordon Linoff

Related questions
                            
                                USE INDEX and MATCH on another FULLTEXT index gives error "Can't find FULLTEXT index matching the column list"
                            
                                How to attach different value for additional field in pivot table Laravel 5
                            
                                How to insert array data to mysql table
                            
                                Regarding date issue in local and server
                            
                                node-mysql connection end or destroy not working
                            
                                Hibernate criteria by latitude and longitude
                            
                                How to insert reactive input values from a shiny app into a MySQL database?
                            
                                How to execute Join queries between multiple Databases that are on different server with Laravel Eloquent?
                            
                                Getting error in Jpa query with spring-boot
                            
                                PHP OOP and MySQLi connection = Fatal error: Call to undefined method mysqli::arrayQuery()
                            
                                Multiple PHP MySQL simpler queries vs one single more complex query,
                            
                                Database design for getting unread article in mysql (over 100M row table)
                            
                                Pagination in Hibernate with joined tables
                            
                                does int(11) and int(6) cause index issue?
                            
                                Mysql Workbench composite key in diagram
                            
                                convert SQL query to query builder style
                            
                                Sorting array on two conditions
                            
                                How does Index Scope work in Mysql?
                            
                                SSL connection to MySQL server with Java
                            
                                PyMySQL executemany with ON DUPLICATE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimization: WHERE x IN (1, 2 .., 100.000) vs INNER JOIN tmp_table USING(x)?

Tags:

sql-server

sqlite

mysql

postgresql

query-optimization

Vladislav Rastrusny

People also ask

1 Answers

Gordon Linoff

Recent Activity

Donate For Us