Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing LINQ routines

I run a build system. Datawise the simplified description would be that I have Configurations and each config has 0..n Builds. Now builds produce artifacts and some of these are stored on server. What I am doing is writing kind of a rule, that sums all the bytes produced per configuration builds and checks if these are too much.

The code for the routine at the moment is following:

private void CalculateExtendedDiskUsage(IEnumerable<Configuration> allConfigurations)
{
    var sw = new Stopwatch();
    sw.Start();
    // Lets take only confs that have been updated within last 7 days
    var items = allConfigurations.AsParallel().Where(x =>
        x.artifact_cleanup_type != null && x.build_cleanup_type != null &&
        x.updated_date > DateTime.UtcNow.AddDays(-7)
        ).ToList();

    using (var ctx = new LocalEntities())
    {
        Debug.WriteLine("Context: " + sw.Elapsed);
        var allBuilds = ctx.Builds;
        var ruleResult = new List<Notification>();
        foreach (var configuration in items)
        {
            // all builds for current configuration
            var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
                .OrderByDescending(z => z.build_date);
            Debug.WriteLine("Filter conf builds: " + sw.Elapsed);

            // Since I don't know which builds/artifacts have been cleaned up, calculate it manually
            if (configuration.build_cleanup_count != null)
            {
                var buildCleanupCount = "30"; // default
                if (configuration.build_cleanup_type.Equals("ReserveBuildsByDays"))
                {
                    var buildLastCleanupDate = DateTime.UtcNow.AddDays(-int.Parse(buildCleanupCount));
                    configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate)
                            .OrderByDescending(z => z.build_date);
                }
                if (configuration.build_cleanup_type.Equals("ReserveBuildsByCount"))
                {
                    var buildLastCleanupCount = int.Parse(buildCleanupCount);
                    configurationBuilds =
                        configurationBuilds.Take(buildLastCleanupCount).OrderByDescending(z => z.build_date);
                }
            }

            if (configuration.artifact_cleanup_count != null)
            {
                // skipped, similar to previous block
            }

            Debug.WriteLine("Done cleanup: " + sw.Elapsed);
            const int maxDiscAllocationPerConfiguration = 1000000000; // 1GB
            // Sum all disc usage per configuration
            var confDiscSizePerConfiguration = configurationBuilds
                .GroupBy(c => new {c.configuration_id})
                .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
                .Select(groupedBuilds =>
                    new
                    {
                        configurationId = groupedBuilds.FirstOrDefault().configuration_id,
                        configurationPath = groupedBuilds.FirstOrDefault().configuration_path,
                        Total = groupedBuilds.Sum(c => c.artifact_dir_size),
                        Average = groupedBuilds.Average(c => c.artifact_dir_size)
                    }).ToList();
            Debug.WriteLine("Done db query: " + sw.Elapsed);

            ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
            {
                ConfigurationId = iter.configurationId,
                CreatedDate = DateTime.UtcNow,
                RuleType = (int) RulesEnum.TooMuchDisc,
                ConfigrationPath = iter.configurationPath
            }));
            Debug.WriteLine("Finished loop: " + sw.Elapsed);
        }
        // find owners and insert...
    }
}

This does exactly what I want, but I am thinking if I could make it any faster. Currenly I see:

Context: 00:00:00.0609067
// first round
Filter conf builds: 00:00:00.0636291
Done cleanup: 00:00:00.0644505
Done db query: 00:00:00.3050122
Finished loop: 00:00:00.3062711
// avg round
Filter conf builds: 00:00:00.0001707
Done cleanup: 00:00:00.0006343
Done db query: 00:00:00.0760567
Finished loop: 00:00:00.0773370

The SQL generated by .ToList() looks very messy. (Everything that is used in WHERE is covered with an index in DB)

I am testing with 200 configurations, so this adds up to 00:00:18.6326722. I have a total of ~8k items that need to get processed daily (so the whole routine takes more than 10 minutes to complete).

I have been randomly googling around this internet and it seems to me that Entitiy Framework is not very good with parallel processing. Knowing that I still decided to give this async/await approch a try (First time a tried it, so sorry for any nonsense).

Basically if I move all the processing out of scope like:

foreach (var configuration in items)
    {

        var confDiscSizePerConfiguration = await GetData(configuration, allBuilds);

        ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
        {
           ... skiped
    } 

And:

private async Task<List<Tmp>> GetData(Configuration configuration, IQueryable<Build> allBuilds)  
{
        var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
            .OrderByDescending(z => z.build_date);
        //..skipped
        var confDiscSizePerConfiguration = configurationBuilds
            .GroupBy(c => new {c.configuration_id})
            .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
            .Select(groupedBuilds =>
                new Tmp
                {
                    ConfigurationId = groupedBuilds.FirstOrDefault().configuration_id,
                    ConfigurationPath = groupedBuilds.FirstOrDefault().configuration_path,
                    Total = groupedBuilds.Sum(c => c.artifact_dir_size),
                    Average = groupedBuilds.Average(c => c.artifact_dir_size)
                }).ToListAsync();
        return await confDiscSizePerConfiguration;
    }

This, for some reason, drops the execution time for 200 items from 18 -> 13 sec. Anyway, from what I understand, since I am awaiting each .ToListAsync(), it is still processed in sequence, is that correct?

So the "can't process in parallel" claim starts coming out when I replace the foreach (var configuration in items) with Parallel.ForEach(items, async configuration =>. Doing this change results in:

A second operation started on this context before a previous asynchronous operation completed. Use 'await' to ensure that any asynchronous operations have completed before calling another method on this context. Any instance members are not guaranteed to be thread safe.

It was a bit confusing to me at first as I await practically in every place where the compiler allows it, but possibly the data gets seeded to fast.

I tried to overcome this by being less greedy and added the new ParallelOptions {MaxDegreeOfParallelism = 4} to that parallel loop, peasant assumption was that default connection pool size is 100, all I want to use is 4, should be plenty. But it still fails.

I have also tried to create new DbContexts inside the GetData method, but it still fails. If I remember correctly (can't test now), I got

Underlying connection failed to open

What possibilities there are to make this routine go faster?

like image 615
Erki M. Avatar asked Oct 19 '22 05:10

Erki M.


1 Answers

Before going in parallel, it is worth to optimize query itself. Here are some suggestions that might improve your times:

1) Use Key when working with GroupBy. This might solve issue of complex & nested SQL query as in that way you instruct Linq to use the same keys defined in GROUP BY and not to create sub-select.

        var confDiscSizePerConfiguration = configurationBuilds
            .GroupBy(c => new { ConfigurationId = c.configuration_id, ConfigurationPath = c.configuration_path})
            .Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
            .Select(groupedBuilds =>
                new
                {
                    configurationId = groupedBuilds.Key.ConfigurationId,
                    configurationPath = groupedBuilds.Key.ConfigurationPath,
                    Total = groupedBuilds.Sum(c => c.artifact_dir_size),
                    Average = groupedBuilds.Average(c => c.artifact_dir_size)
                })
            .ToList();

2) It seems that you are bitten by N+1 problem. In simple words - you execute one SQL query to get all configurations and N another ones to get build information. In total that would be ~8k small queries where 2 bigger queries would suffice. If used memory is not a constraint, fetch all build data in memory and optimize for fast lookup using ToLookup.

var allBuilds = ctx.Builds.ToLookup(x=>x.configuration_id);

Later you can lookup builds by:

var configurationBuilds = allBuilds[configuration.configuration_id].OrderByDescending(z => z.build_date);

3) You are doing OrderBy on configurationBuilds multiple times. Filtering does not affect record order, so you can safely remove extra calls to OrderBy:

...
configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate);
...
configurationBuilds = configurationBuilds.Take(buildLastCleanupCount);
...

4) There is no point to do GroupBy as builds are already filtered for a single configuration.

UPDATE:

I took it one step further and created code that would retrieve same results as your provided code with a single request. It should be more performant and use less memory.

private void CalculateExtendedDiskUsage()
{
    using (var ctx = new LocalEntities())
    {
        var ruleResult = ctx.Configurations
            .Where(x => x.build_cleanup_count != null && 
                (
                    (x.build_cleanup_type == "ReserveBuildsByDays" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).Where(y => y.build_date > buildLastCleanupDate).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration) ||
                    (x.build_cleanup_type == "ReserveBuildsByCount" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).OrderByDescending(y => y.build_date).Take(buildCleanupCount).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration)
                )
            )
            .Select(x => new Notification
            {
                ConfigurationId = x.configuration_id,
                ConfigrationPath = x.configuration_path
                CreatedDate = DateTime.UtcNow,
                RuleType = (int)RulesEnum.TooMuchDisc,
            })
            .ToList();
    }
}
like image 114
Kaspars Ozols Avatar answered Nov 03 '22 03:11

Kaspars Ozols