Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does this LINQ code perform multiple lookups on the original data?

We've got a slight performance issue in a section of code that's using LINQ and it's raised a question about how LINQ works in terms of lookups

My question is this (please note that I have changed all the code so this is an indicative example of the code, not the real scenario):

Given

public class Person {
 int ID;
 string Name;
 DateTime Birthday; 
 int OrganisationID;
}

If I had a list of say 100k Person objects and then a list of dates, say 1000, and I ran this code:

var personBirthdays = from Person p in personList
    where p.OrganisationID = 123
    select p.Birthday;

foreach (DateTime d in dateList)
{
    if (personBirthdays.Contains(d))
        Console.WriteLine(string.Format("Date: {0} has a Birthday", d.ToShortDateString()));
}

The resultant code would be an iteration of:

100k (the loop that needs to be done to find the users with the organisationID 123)
multiplied by
1000 (the amount of dates in the list)
multiplied by
x (the amount of users who have the organisationID 123 to be checked against for the date)

This is a lot of iterations!

If I changed the code the personBirthdays to this:

List<DateTime> personBirthdays = 
        (from Person p in personList
        where p.OrganisationID = 123
        select p.Birthday).ToList();

This should remove the 100k as a multiple by, and just do it once?

So you would have 100k + (1000 * x) instead of (100k * 1000 * x).

The question is that this seems too easy, and I'm sure the LINQ is doing something clever somewhere that should mean that this doesn't happen.

If no one answers, I'll run some tests and report back.

Clarity update: We're not considering Database lookups, the personList object is an In Memory list object. This all LINQ-to-Objects.

like image 471
Martin Avatar asked Dec 11 '12 15:12

Martin


2 Answers

This should remove the 10k as a multiple by, and just do it once?

What it means is that instead of iterating personList 100k times, performing the where and select operations for each of those iterations that you'll be iterating the resulting List 100k times, and that the where and select operations will only have been performed on the underlying data source once.

The question is that this seems too easy, and I'm sure the LINQ is doing something clever somewhere that should mean that this doesn't happen.

Nope, your first query is simply something that you shouldn't be doing using LINQ, you should be taking the results of the query and placing them into a data structure if you plan to iterate over them many times (which is what you changed).

You can improve this query even more by using the appropriate data structure. Searching on a List is rather inefficient, as it needs to do a linear search. It would be preferable to use a HashSet to store the results of the query. A HashSet has O(1) search speed in the average case, as opposed to O(n) search time of a List.

var dates = new HashSet<DateTime>(from Person p in personList
                                  where p.OrganisationID = 123
                                  select p.Birthday);

foreach (DateTime d in dateList.Where(date => dates.Contains(date)))
{
    Console.WriteLine(string.Format("Date: {0} has a Birthday", d.ToShortDateString()));
}
like image 52
Servy Avatar answered Nov 12 '22 05:11

Servy


This is typical select n+1 problem, and after you applied .ToList() you have partially solved it. Next step could be following: you're constantly iterating over personBirthdays list, replace it with HashSet and you could perform Contains(d) much much faster and remove duplicates:

var personBirthdays = new HashSet<DateTime>((from Person p in personList
    where p.OrganisationID = 123
    select p.Birthday).ToArray());
like image 31
Akim Avatar answered Nov 12 '22 07:11

Akim