Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing Linq: IN operator with large data

How to optimize this query?

// This will return data ranging from 1 to 500,000 records
List<string> products = GetProductsNames(); 


List<Product> actualProducts = (from p in db.Products
                               where products.Contains(p.Name)
                               select p).ToList();

This code takes around 30 seconds to fill actualProducts if I send a list of 44,000 strings, dont know what it takes for 500,000 records. :(

any way to tweak this query?

NOTE: it takes almost this much time for each call (ignoring the first slow edmx call)

like image 330
UsmanAzam Avatar asked May 15 '26 05:05

UsmanAzam


2 Answers

An IN query on 500,000 records is always going to be a pathological case.

Firstly, make sure there is an index (probably non-clustered) on Name in the database.

Ideas (both involve dropping to ADO.NET):

  • use a "table valued parameter" to pass in the values, and INNER JOIN to the table-valued-parameter in TSQL
  • alternatively, create a table of the form ProductQuery with columns QueryId (which could be uniqueidentifier) and Name; invent a guid to represent your query (Guid.NewGuid()), and then use SqlBulkCopy to push the 500,000 pairs (the same guid on each row; different guids are different queries) into the table really quickly; then use TSQL to do an INNER JOIN between the two tables

Actually, these are very similar, but the first one is probably the first thing to try. Less to set up.

like image 154
Marc Gravell Avatar answered May 19 '26 04:05

Marc Gravell


If you don't want to use Database you could try something with Dictionary<string,string>

If am not wrong I suspect products.Contains(p.Name) is expensive since it is O(n) operation. Try to change your GetProductsNames return type as Dictionary<string,string> or convert List to Dictionary

Dictionary<string, string> productsDict = products.ToDictionary(x => x);

So you have a dictionary in hand, now rewrite the query as below

List<Product> actualProducts = (from p in db.Products
                           where productsDict.ContainsKey(p.Name)
                           select p).ToList();

This will help you to improve performance a lot(disadvantage is you allocate double memory advantage is performance). I tested with very large samples with good results. Try it out.

Hope this helps.

like image 37
Sriram Sakthivel Avatar answered May 19 '26 02:05

Sriram Sakthivel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!