I have a huge table which I need to read through on a certain order and compute some aggregate statistics. The table already has a clustered index for the correct order so getting the records themselves is pretty fast. I'm trying to use LINQ to SQL to simplify the code that I need to write. The problem is that I don't want to load all the objects into memory, since the DataContext seems to keep them around -- yet trying to page them results in horrible performance problems.
Here's the breakdown. Original attempt was this:
var logs =
(from record in dataContext.someTable
where [index is appropriate]
select record);
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
This is pretty fast, and streams at a good rate, but the problem is that the memory use of the application keeps going up never stops. My guess is that the LINQ to SQL entities are being kept around in memory and not being disposed properly. So after reading Out of memory when creating a lot of objects C# , I tried the following approach. This seems to be the common Skip
/Take
paradigm that many people use, with the added feature of saving memory.
Note that _conn
is created beforehand, and a temporary data context is created for each query, resulting in the associated entities being garbage collected.
int skipAmount = 0;
bool finished = false;
while (!finished)
{
// Trick to allow for automatic garbage collection while iterating through the DB
using (var tempDataContext = new MyDataContext(_conn) {CommandTimeout = 600})
{
var query =
(from record in tempDataContext.someTable
where [index is appropriate]
select record);
List<workerLog> logs = query.Skip(skipAmount).Take(BatchSize).ToList();
if (logs.Count == 0)
{
finished = true;
continue;
}
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
skipAmount += logs.Count;
}
}
Now I have the desired behavior that memory usage doesn't increase at all as I am streaming through the data. Yet, I have a far worse problem: each Skip
is causing the data to load more and more slowly as the underlying query seems to actually cause the server to go through all the data for all previous pages. While running the query each page takes longer and longer to load, and I can tell that this is turning into a quadratic operation. This problem has appeared in the following posts:
I can't seem to find a way to do this with LINQ that allows me to have limited memory use by paging data, and yet still have each page load in constant time. Is there a way to do this properly? My hunch is that there might be some way to tell the DataContext to explicitly forget about the object in the first approach above, but I can't find out how to do that.
Stored procedures are faster as compared to LINQ query since they have a predictable execution plan and can take the full advantage of SQL features. Hence, when a stored procedure is being executed next time, the database used the cached execution plan to execute that stored procedure.
Advantages of LINQStandardized way of querying multiple data sources: The same LINQ syntax can be used to query multiple data sources. Compile time safety of queries: It provides type checking of objects at compile time. IntelliSense Support: LINQ provides IntelliSense for generic collections.
After madly grasping at some straws, I found that the DataContext
's ObjectTrackingEnabled = false
could be just what the doctor ordered. It is, not surprisingly, specifically designed for a read-only case like this.
using (var readOnlyDataContext =
new MyDataContext(_conn) {CommandTimeout = really_long, ObjectTrackingEnabled = false})
{
var logs =
(from record in readOnlyDataContext.someTable
where [index is appropriate]
select record);
foreach( linqEntity l in logs )
{
// Do stuff with data from l
}
}
The above approach does not use any memory when streaming through objects. When writing data, I can use a different DataContext
that has object tracking enabled, and that seems to work okay. However, this approach does have the problem of a SQL query that can take an hour or more to stream and complete, so if there's a way to do the paging as above without the performance hit, I'm open to other alternatives.
A warning about turning object tracking off: I found out that when you try to do multiple concurrent reads with the same DataContext
, you don't get the error There is already an open DataReader associated with this Command which must be closed first.
The application just goes into an infinite loop with 100% CPU usage. I'm not sure if this is a C# bug or a feature.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With