Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is WHERE ID IN (1, 2, 3, 4, 5, ...) the most efficient?

I know that this topic has been beaten to death, but it seems that many articles on the Internet are often looking for the most elegant way instead of the most efficient way how to solve it. Here is the problem. We are building an application where one of the common database querys will involve manipulation (SELECT’s and UPDATE’s) based on a user supplied list of ID’s. The table in question is expected to have hundreds of thousands of rows, and the user provided lists of ID’s can potentially unbounded, bust they will be most likely in terms of tens or hundreds (we may limit it for performance reasons later).

If my understanding of how databases work in general is correct, the most efficient is to simply use the WHERE ID IN (1, 2, 3, 4, 5, ...) construct and build queries dynamically. The core of the problem is the input lists of ID’s will be really arbitrary, and so no matter how clever the database is or how cleverly we implement it, we always have an random subset of integers to start with and so eventually every approach has to internally boil down to something like WHERE ID IN (1, 2, 3, 4, 5, ...) anyway.

One can find many approaches all over the web. For instance, one involves declaring a table variable, passing the list of ID’s to a store procedure as a comma delimited string, splitting it in the store procedure, inserting the ID’s into the table variable and joining the master table on it, i.e. something like this:

-- 1. Temporary table for ID’s:
DECLARE @IDS TABLE (ID int);

-- 2. Split the given string of ID’s, and each ID to @IDS.
-- Omitted for brevity.

-- 3. Join the main table to @ID’s:
SELECT MyTable.ID, MyTable.SomeColumn
FROM MyTable INNER JOIN @IDS ON MyTable.ID = @IDS.ID;

Putting the problems with string manipulation aside, I think what essentially happens in this case is that in the third step the SQL Server says: “Thank you, that’s nice, but I just need a list of the ID’s”, and it scans the table variable @IDS, and then does n seeks in MyTable where n is the number of the ID’s. I’ve done some elementary performance evaluations and inspected the query plan, and it seems that this is what happens. So the table variable, the string concatenation and splitting and all the extra INSERT’s are for nothing.

Am I correct? Or am I missing anything? Is there really some clever and more efficient way? Basically, what I’m saying is that the SQL Server has to do n index seeks no matter what and formulating the query as WHERE ID IN (1, 2, 3, 4, 5, ...) is the most straightforward way to ask for it.

like image 696
Jan Zich Avatar asked Oct 05 '09 20:10

Jan Zich


2 Answers

Well, it depends on what's really going on. How is the user choosing these IDs?

Also, it's not just efficiency; there's also security and correctness to worry about. When and how does the user tell the database about their ID choices? How do you incorporate them into the query?

It might be much better to put the selected IDs into a separate table that you can join against (or use a WHERE EXISTS against).

I'll give you that you're not likely to do much better performance-wise than IN (1,2,3..n) for a small (user-generated) n. But you need to think about how you generate that query. Are you going to use dynamic SQL? If so, how will you secure it from injection? Will the server be able to cache the execution plan?

Also, using an extra table is often just easier. Say you're building a shopping cart for an eCommerce site. Rather than worrying up keeping track of the cart client side or in a session, it's likely better to update the ShoppingCart table every time the user makes a selection. This also avoids the whole problem of how to safely set the parameter value for your query, because you're only making one change at a time.

Don't forget to old adage (with apologies to Benjamin Franklin):

He who would trade correctness for performance deserves neither

like image 194
Joel Coehoorn Avatar answered Oct 21 '22 07:10

Joel Coehoorn


Be careful; on many databases, IN (...) is limited to a fixed number of things in the IN clause. For example, I think it's 1000 in Oracle. That's big, but possibly worth knowing.

like image 41
Dean J Avatar answered Oct 21 '22 07:10

Dean J