Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieving large amount of records under multiple limitations, without causing an out-of-memory exception

I've got the following situation:

  1. There are two related types. For this question, I'll use the following simple types:

    public class Person
    {
         public Guid Id {get; set;}
         public int status {get; set;}
    }
    
    public class Account
    {
         public Guid AccountId {get; set;}
         public decimal Amount { get; set; }
         public Guid PersonId { get; set; }
    }
    

    So that one Person might have multiple Accounts (i.e., multiple Accounts would reference the same PersonId).

  2. In our database, there are tens of thousands of persons, and each have 5-10 accounts on average.

  3. I need to retrieve each person's accounts, assuming they fulfill certain requirements. Afterwards, I need to see if all of this person's accounts, together, fulfill another condition.

    In this example, let's say I need every account with amount < 100, and that after retrieving one person's accounts, I need to check if their sum is larger than 1000.

  4. Using a LINQ query is desirable, but can't be done using group-by-into keywords, because the Linq-Provider (LINQ-to-CRM) doesn't support it.

  5. In addition, doing the following simple LINQ query to implement listing 3 requirements is also not possible (please read the inlined comment):

    var query = from p in personList
                join a in accountList on p.Id equals a.PersonId
                where a.Amount < 100
                select a;
    var groups = query.GroupBy(a => a.PersonId);
    // and now, run in bulks on x groups 
    // (let x be the groups amount that won't cause an out-of-memory exception)
    

    It is not possible for 2 reasons:

    a. The Linq-Provider force a call to ToList() before using GroupBy().

    b. Trying to actually call ToList() before using GroupBy() results in an out-of-memory exception - since there are tens of thousands of accounts.

  6. For efficiency reasons, I don't want to do the following, since it means tens of thousands retrievals:

    a. Retrieve all persons.

    b. Loop through them and retrieve each person's accounts on each iteration.

Will be glad for efficient ideas.

like image 825
OfirD Avatar asked Jul 03 '17 18:07

OfirD


1 Answers

I would suggest ordering the query by PersonId, switching to LINQ to Objects via AsEnumerable() (thus executing it, but without materializing the whole result set in memory like ToList() call), and then use the GroupAdjacent method from MoreLINQ package:

This method is implemented by using deferred execution and streams the groupings. The grouping elements, however, are buffered. Each grouping is therefore yielded as soon as it is complete and before the next grouping occurs.

var query = from p in personList
            join a in accountList on p.Id equals a.PersonId
            where a.Amount < 100
            orderby a.PersonId
            select a;
var groups = query.AsEnumerable()
    .GroupAdjacent(a => a.PersonId)
    .Where(g => g.Sum(a => a.Amount) > 1000);

The AsEnumerable() trick works well with EF query provider for sure. Whether it works with LINQ to CRM provider really depends on how the provider implements GetEnumerator() method - if it tries to buffer the whole query result anyway, then you are out of luck.

like image 53
Ivan Stoev Avatar answered Oct 29 '22 05:10

Ivan Stoev