I'm running a MYSQL query in two steps. First, I get a list of ids with one query, and then I retrieve the data for those ids using a second query along the lines of <code>SELECT * FROM data WHERE id in (id1, id2 ...)</code>. I know it sounds hacky, but I've done it this way as the queries are very complicated; the first involves lots of geometry and triggernometry, the second one lots of different joins. I'm sure they could be written in a single query, but my MYSQL isn't good enough to pull it off. This approach works, but it doesn't feel right; plus I'm concerned it won't scale. At the moment I am testing on a database of 10,000 records, with 400 ids in the "IN" clause ( i.e. <code>IN (id1, id2 ... id400)</code> ) and performance is fine. But what if there are say 1,000,000 records? Where are the performance bottlenecks (speed, memory, etc) for this kind of query? Any ideas for how to refactor this kind of query for be awesome too. (for example, if it is worth swotting up on stored procedures).

Starting from a certain number of records, the <code>IN</code> predicate over a <code>SELECT</code> becomes faster than that over a list of constants. See this article in my blog for performance comparison: <ul> <li>Passing parameters in MySQL: IN list vs. temporary table</li> </ul> If the column used in the query in the <code>IN</code> clause is indexed, like this: <pre class="prettyprint"><code>SELECT * FROM table1 WHERE unindexed_column IN ( SELECT indexed_column FROM table2 ) </code></pre> , then this query is just optimized to an <code>EXISTS</code> (which uses but a one entry for each record from <code>table1</code>) Unfortunately, <code>MySQL</code> is not capable of doing <code>HASH SEMI JOIN</code> or <code>MERGE SEMI JOIN</code> which are yet more efficient (especially if both columns are indexed).

Performance of MYSQL "IN"

Tags:

I'm running a MYSQL query in two steps. First, I get a list of ids with one query, and then I retrieve the data for those ids using a second query along the lines of SELECT * FROM data WHERE id in (id1, id2 ...). I know it sounds hacky, but I've done it this way as the queries are very complicated; the first involves lots of geometry and triggernometry, the second one lots of different joins. I'm sure they could be written in a single query, but my MYSQL isn't good enough to pull it off.

This approach works, but it doesn't feel right; plus I'm concerned it won't scale. At the moment I am testing on a database of 10,000 records, with 400 ids in the "IN" clause ( i.e. IN (id1, id2 ... id400) ) and performance is fine. But what if there are say 1,000,000 records?

Where are the performance bottlenecks (speed, memory, etc) for this kind of query? Any ideas for how to refactor this kind of query for be awesome too. (for example, if it is worth swotting up on stored procedures).

610

asked Oct 08 '09 13:10

Roy

2 Answers

Starting from a certain number of records, the IN predicate over a SELECT becomes faster than that over a list of constants.

See this article in my blog for performance comparison:

Passing parameters in MySQL: IN list vs. temporary table

If the column used in the query in the IN clause is indexed, like this:

SELECT  * FROM    table1 WHERE   unindexed_column IN         (         SELECT  indexed_column         FROM    table2         )

, then this query is just optimized to an EXISTS (which uses but a one entry for each record from table1)

Unfortunately, MySQL is not capable of doing HASH SEMI JOIN or MERGE SEMI JOIN which are yet more efficient (especially if both columns are indexed).

193

answered Oct 04 '22 16:10

Quassnoi

Why do you extract the ids first? You should probably just join the tables. If you use the ids for something else, you can insert them in a temp table before and use this table for the join.

answered Oct 04 '22 15:10

Eric Hogue

Related questions
                            
                                What is the use of <T> in public static <T> T addAndReturn(T element, Collection<T> collection){
                            
                                Image in SELECT element [duplicate]
                            
                                Interface Go with C libraries
                            
                                Crowdsourcing a Complete list of Common Java System Properties and Known Values
                            
                                Where does input validation belong in an MVC application?
                            
                                Garbage Collection and Threads
                            
                                WPF: Stop Binding if a UI element is not visible
                            
                                How are arrays implemented in java?
                            
                                DAO design pattern and using it across multiple tables
                            
                                Making stand-alone jar with Simple Build Tool
                            
                                What goes in to making a web site that needs to scale?
                            
                                Efficient ways to sort a deck of actual cards

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With