Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to search a List<T> across multiple properties?

Tags:

c#

linq

I have a process I've inherited that I'm converting to C# from another language. Numerous steps in the process loop through what can be a lot of records (100K-200K) to do calculations. As part of those processes it generally does a lookup into another list to retrieve some values. I would normally move this kind of thing into a SQL statement (and we have where we've been able to) but in these cases there isn't really an easy way to do that. In some places we've attempted to convert the code to a stored procedure and decided it wasn't working nearly as well as we had hoped.

Effectively, the code does this:

var match = cost.Where(r => r.ryp.StartsWith(record.form.TrimEnd()) && 
                       r.year == record.year && 
                       r.period == record.period).FirstOrDefault();

cost is a local List type. If I was doing a search on only one field I'd probably just move this into a Dictionary. The records aren't always unique either.

Obviously, this is REALLY slow.

I ran across the open source library I4O which can build indexes, however it fails for me in various queries (and I don't really have the time to attempt to debug the source code). It also doesn't work with .StartsWith or .Contains (StartsWith is much more important since a lot of the original queries take advantage of the fact that doing a search for "A" would find a match in "ABC").

Are there any other projects (open source or commercial) that do this sort of thing?

EDIT:

I did some searching based on the feedback and found Power Collections which supports dictionaries that have keys that aren't unique.

I tested ToLookup() which worked great - it's still not quite as fast as the original code, but it's at least acceptable. It's down from 45 seconds to 3-4 seconds. I'll take a look at the Trie structure for the other look ups.

Thanks.

like image 453
Paul Mrozowski Avatar asked Apr 11 '12 16:04

Paul Mrozowski


2 Answers

Looping through a list of 100K-200K items doesn't take very long. Finding matching items within the list by using nested loops (n^2) does take long. I infer this is what you're doing (since you have assignment to a local match variable).

If you want to quickly match items together, use .ToLookup.

var lookup = cost.ToLookup(r => new {r.year, r.period, form = r.ryp});

foreach(var group in lookup)
{
  // do something with items in group.
}

Your startswith criteria is troublesome for key-based matching. One way to approach that problem is to ignore it when generating keys.

var lookup = cost.ToLookup(r => new {r.year, r.period });
var key = new {record.year, record.period};
string lookForThis = record.form.TrimEnd();
var match = lookup[key].FirstOrDefault(r => r.ryp.StartsWith(lookForThis))

Ideally, you would create the lookup once and reuse it for many queries. Even if you didn't... even if you created the lookup each time, it will still be faster than n^2.

like image 83
Amy B Avatar answered Oct 12 '22 23:10

Amy B


Certainly you can do better than this. Let's start by considering that dictionaries are not useful only when you want to query one field; you can very easily have a dictionary where the key is an immutable value that aggregates many fields. So for this particular query, an immediate improvement would be to create a key type:

// should be immutable, GetHashCode and Equals should be implemented, etc etc
struct Key
{
    public int year;
    public int period;
}

and then package your data into an IDictionary<Key, ICollection<T>> or similar where T is the type of your current list. This way you can cut down heavily on the number of rows considered in each iteration.

The next step would be to use not an ICollection<T> as the value type but a trie (this looks promising), which is a data structure tailored to finding strings that have a specified prefix.

Finally, a free micro-optimization would be to take the TrimEnd out of the loop.

Now certainly all of this only applies to the specific example given and may need to be revisited due to other specifics of your situation, but in any case you should be able to extract practical gain from this or something similar.

like image 31
Jon Avatar answered Oct 12 '22 23:10

Jon