I have this query:
var newComponents = from ic in importedComponents
where !existingComponents.Contains(ic)
select ic;
importedComponents
and existingComponents
are of type List<ImportedComponent>
, and exist only in memory (are not tied to a data context). In this instance, importedComponents
has just over 6,100 items, and existingComponents
has 511 items.
This statement is taking too long to complete (I don't know how long, I stop the script after 20 minutes). I've tried the following with no improvement in execution speed:
var existingComponentIDs = from ec in existingComponents
select ec.ID;
var newComponents = from ic in importedComponents
where !existingComponentIDs.Contains(ic.ID)
select ic;
Any help will be much appreciated.
The problem is quadratic complexity of this algorithm. Put the IDs of all existingComponentIDs into a HashSet and use the HashSet.Contains method. It has O(1) lookup cost compared to O(N) for Contains/Any on a list.
The morelinq project contains a method that does all of that in one convenient step: ExceptBy.
You could use Except
to get the set difference:
var existingComponentIDs = existingComponents.Select(c => c.ID);
var importedComponentIDs = importedComponents.Select(c => c.ID);
var newComponentIDs = importedComponentIDs.Except(existingComponentIDs);
var newComponents = from ic in importedComponents
join newID in newComponentIDs on ic.ID equals newID
select ic;
foreach (var c in newComponents)
{
// insert into database?
}
Why is LINQ JOIN so much faster than linking with WHERE?
In short: Join method can set up a hash table to use as an index to quicky zip two tables together
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With