Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read huge table with LINQ to SQL: Running out of memory vs slow paging

I have a huge table which I need to read through on a certain order and compute some aggregate statistics. The table already has a clustered index for the correct order so getting the records themselves is pretty fast. I'm trying to use LINQ to SQL to simplify the code that I need to write. The problem is that I don't want to load all the objects into memory, since the DataContext seems to keep them around -- yet trying to page them results in horrible performance problems.

Here's the breakdown. Original attempt was this:

var logs = 
    (from record in dataContext.someTable 
     where [index is appropriate]
     select record);

foreach( linqEntity l in logs )
{
    // Do stuff with data from l
}

This is pretty fast, and streams at a good rate, but the problem is that the memory use of the application keeps going up never stops. My guess is that the LINQ to SQL entities are being kept around in memory and not being disposed properly. So after reading Out of memory when creating a lot of objects C# , I tried the following approach. This seems to be the common Skip/Take paradigm that many people use, with the added feature of saving memory.

Note that _conn is created beforehand, and a temporary data context is created for each query, resulting in the associated entities being garbage collected.

int skipAmount = 0;
bool finished = false;

while (!finished)
{
    // Trick to allow for automatic garbage collection while iterating through the DB
    using (var tempDataContext = new MyDataContext(_conn) {CommandTimeout = 600})
    {               
        var query =
            (from record in tempDataContext.someTable
             where [index is appropriate]
             select record);

        List<workerLog> logs = query.Skip(skipAmount).Take(BatchSize).ToList();
        if (logs.Count == 0)
        {
            finished = true;
            continue;
        }

        foreach( linqEntity l in logs )
        {
            // Do stuff with data from l
        }

        skipAmount += logs.Count;
    }
}

Now I have the desired behavior that memory usage doesn't increase at all as I am streaming through the data. Yet, I have a far worse problem: each Skip is causing the data to load more and more slowly as the underlying query seems to actually cause the server to go through all the data for all previous pages. While running the query each page takes longer and longer to load, and I can tell that this is turning into a quadratic operation. This problem has appeared in the following posts:

  • LINQ Skip() Problem
  • LINQ2SQL select orders and skip/take

I can't seem to find a way to do this with LINQ that allows me to have limited memory use by paging data, and yet still have each page load in constant time. Is there a way to do this properly? My hunch is that there might be some way to tell the DataContext to explicitly forget about the object in the first approach above, but I can't find out how to do that.

like image 533
Andrew Mao Avatar asked Sep 17 '12 23:09

Andrew Mao


People also ask

Which is faster LINQ or SQL?

Stored procedures are faster as compared to LINQ query since they have a predictable execution plan and can take the full advantage of SQL features. Hence, when a stored procedure is being executed next time, the database used the cached execution plan to execute that stored procedure.

What is the main advantage of using LINQ for database queries?

Advantages of LINQStandardized way of querying multiple data sources: The same LINQ syntax can be used to query multiple data sources. Compile time safety of queries: It provides type checking of objects at compile time. IntelliSense Support: LINQ provides IntelliSense for generic collections.


1 Answers

After madly grasping at some straws, I found that the DataContext's ObjectTrackingEnabled = false could be just what the doctor ordered. It is, not surprisingly, specifically designed for a read-only case like this.

using (var readOnlyDataContext = 
    new MyDataContext(_conn) {CommandTimeout = really_long, ObjectTrackingEnabled = false})
{                                                 
    var logs =
        (from record in readOnlyDataContext.someTable
         where [index is appropriate]
         select record);

    foreach( linqEntity l in logs )
    {
        // Do stuff with data from l   
    }                
}

The above approach does not use any memory when streaming through objects. When writing data, I can use a different DataContext that has object tracking enabled, and that seems to work okay. However, this approach does have the problem of a SQL query that can take an hour or more to stream and complete, so if there's a way to do the paging as above without the performance hit, I'm open to other alternatives.

A warning about turning object tracking off: I found out that when you try to do multiple concurrent reads with the same DataContext, you don't get the error There is already an open DataReader associated with this Command which must be closed first. The application just goes into an infinite loop with 100% CPU usage. I'm not sure if this is a C# bug or a feature.

like image 137
Andrew Mao Avatar answered Sep 30 '22 15:09

Andrew Mao