I run a build system. Datawise the simplified description would be that I have Configurations and each config has 0..n Builds. Now builds produce artifacts and some of these are stored on server. What I am doing is writing kind of a rule, that sums all the bytes produced per configuration builds and checks if these are too much.
The code for the routine at the moment is following:
private void CalculateExtendedDiskUsage(IEnumerable<Configuration> allConfigurations)
{
var sw = new Stopwatch();
sw.Start();
// Lets take only confs that have been updated within last 7 days
var items = allConfigurations.AsParallel().Where(x =>
x.artifact_cleanup_type != null && x.build_cleanup_type != null &&
x.updated_date > DateTime.UtcNow.AddDays(-7)
).ToList();
using (var ctx = new LocalEntities())
{
Debug.WriteLine("Context: " + sw.Elapsed);
var allBuilds = ctx.Builds;
var ruleResult = new List<Notification>();
foreach (var configuration in items)
{
// all builds for current configuration
var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
.OrderByDescending(z => z.build_date);
Debug.WriteLine("Filter conf builds: " + sw.Elapsed);
// Since I don't know which builds/artifacts have been cleaned up, calculate it manually
if (configuration.build_cleanup_count != null)
{
var buildCleanupCount = "30"; // default
if (configuration.build_cleanup_type.Equals("ReserveBuildsByDays"))
{
var buildLastCleanupDate = DateTime.UtcNow.AddDays(-int.Parse(buildCleanupCount));
configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate)
.OrderByDescending(z => z.build_date);
}
if (configuration.build_cleanup_type.Equals("ReserveBuildsByCount"))
{
var buildLastCleanupCount = int.Parse(buildCleanupCount);
configurationBuilds =
configurationBuilds.Take(buildLastCleanupCount).OrderByDescending(z => z.build_date);
}
}
if (configuration.artifact_cleanup_count != null)
{
// skipped, similar to previous block
}
Debug.WriteLine("Done cleanup: " + sw.Elapsed);
const int maxDiscAllocationPerConfiguration = 1000000000; // 1GB
// Sum all disc usage per configuration
var confDiscSizePerConfiguration = configurationBuilds
.GroupBy(c => new {c.configuration_id})
.Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
.Select(groupedBuilds =>
new
{
configurationId = groupedBuilds.FirstOrDefault().configuration_id,
configurationPath = groupedBuilds.FirstOrDefault().configuration_path,
Total = groupedBuilds.Sum(c => c.artifact_dir_size),
Average = groupedBuilds.Average(c => c.artifact_dir_size)
}).ToList();
Debug.WriteLine("Done db query: " + sw.Elapsed);
ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
{
ConfigurationId = iter.configurationId,
CreatedDate = DateTime.UtcNow,
RuleType = (int) RulesEnum.TooMuchDisc,
ConfigrationPath = iter.configurationPath
}));
Debug.WriteLine("Finished loop: " + sw.Elapsed);
}
// find owners and insert...
}
}
This does exactly what I want, but I am thinking if I could make it any faster. Currenly I see:
Context: 00:00:00.0609067
// first round
Filter conf builds: 00:00:00.0636291
Done cleanup: 00:00:00.0644505
Done db query: 00:00:00.3050122
Finished loop: 00:00:00.3062711
// avg round
Filter conf builds: 00:00:00.0001707
Done cleanup: 00:00:00.0006343
Done db query: 00:00:00.0760567
Finished loop: 00:00:00.0773370
The SQL
generated by .ToList()
looks very messy. (Everything that is used in WHERE
is covered with an index in DB)
I am testing with 200 configurations, so this adds up to 00:00:18.6326722. I have a total of ~8k items that need to get processed daily (so the whole routine takes more than 10 minutes to complete).
I have been randomly googling around this internet and it seems to me that Entitiy Framework
is not very good with parallel processing. Knowing that I still decided to give this async/await
approch a try (First time a tried it, so sorry for any nonsense).
Basically if I move all the processing out of scope like:
foreach (var configuration in items)
{
var confDiscSizePerConfiguration = await GetData(configuration, allBuilds);
ruleResult.AddRange(confDiscSizePerConfiguration.Select(iter => new Notification
{
... skiped
}
And:
private async Task<List<Tmp>> GetData(Configuration configuration, IQueryable<Build> allBuilds)
{
var configurationBuilds = allBuilds.Where(x => x.configuration_id == configuration.configuration_id)
.OrderByDescending(z => z.build_date);
//..skipped
var confDiscSizePerConfiguration = configurationBuilds
.GroupBy(c => new {c.configuration_id})
.Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
.Select(groupedBuilds =>
new Tmp
{
ConfigurationId = groupedBuilds.FirstOrDefault().configuration_id,
ConfigurationPath = groupedBuilds.FirstOrDefault().configuration_path,
Total = groupedBuilds.Sum(c => c.artifact_dir_size),
Average = groupedBuilds.Average(c => c.artifact_dir_size)
}).ToListAsync();
return await confDiscSizePerConfiguration;
}
This, for some reason, drops the execution time for 200 items from 18 -> 13 sec. Anyway, from what I understand, since I am await
ing each .ToListAsync()
, it is still processed in sequence, is that correct?
So the "can't process in parallel" claim starts coming out when I replace the foreach (var configuration in items)
with Parallel.ForEach(items, async configuration =>
. Doing this change results in:
A second operation started on this context before a previous asynchronous operation completed. Use 'await' to ensure that any asynchronous operations have completed before calling another method on this context. Any instance members are not guaranteed to be thread safe.
It was a bit confusing to me at first as I await
practically in every place where the compiler allows it, but possibly the data gets seeded to fast.
I tried to overcome this by being less greedy and added the new ParallelOptions {MaxDegreeOfParallelism = 4}
to that parallel loop, peasant assumption was that default connection pool size is 100, all I want to use is 4, should be plenty. But it still fails.
I have also tried to create new DbContexts inside the GetData
method, but it still fails. If I remember correctly (can't test now), I got
Underlying connection failed to open
What possibilities there are to make this routine go faster?
Before going in parallel, it is worth to optimize query itself. Here are some suggestions that might improve your times:
1) Use Key
when working with GroupBy
. This might solve issue of complex & nested SQL query as in that way you instruct Linq to use the same keys defined in GROUP BY
and not to create sub-select.
var confDiscSizePerConfiguration = configurationBuilds
.GroupBy(c => new { ConfigurationId = c.configuration_id, ConfigurationPath = c.configuration_path})
.Where(c => (c.Sum(z => z.artifact_dir_size) > maxDiscAllocationPerConfiguration))
.Select(groupedBuilds =>
new
{
configurationId = groupedBuilds.Key.ConfigurationId,
configurationPath = groupedBuilds.Key.ConfigurationPath,
Total = groupedBuilds.Sum(c => c.artifact_dir_size),
Average = groupedBuilds.Average(c => c.artifact_dir_size)
})
.ToList();
2) It seems that you are bitten by N+1 problem. In simple words - you execute one SQL query to get all configurations and N another ones to get build information. In total that would be ~8k small queries where 2 bigger queries would suffice. If used memory is not a constraint, fetch all build data in memory and optimize for fast lookup using ToLookup
.
var allBuilds = ctx.Builds.ToLookup(x=>x.configuration_id);
Later you can lookup builds by:
var configurationBuilds = allBuilds[configuration.configuration_id].OrderByDescending(z => z.build_date);
3) You are doing OrderBy
on configurationBuilds
multiple times. Filtering does not affect record order, so you can safely remove extra calls to OrderBy
:
...
configurationBuilds = configurationBuilds.Where(x => x.build_date > buildLastCleanupDate);
...
configurationBuilds = configurationBuilds.Take(buildLastCleanupCount);
...
4) There is no point to do GroupBy
as builds are already filtered for a single configuration.
UPDATE:
I took it one step further and created code that would retrieve same results as your provided code with a single request. It should be more performant and use less memory.
private void CalculateExtendedDiskUsage()
{
using (var ctx = new LocalEntities())
{
var ruleResult = ctx.Configurations
.Where(x => x.build_cleanup_count != null &&
(
(x.build_cleanup_type == "ReserveBuildsByDays" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).Where(y => y.build_date > buildLastCleanupDate).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration) ||
(x.build_cleanup_type == "ReserveBuildsByCount" && ctx.Builds.Where(y => y.configuration_id == x.configuration_id).OrderByDescending(y => y.build_date).Take(buildCleanupCount).Sum(y => y.artifact_dir_size) > maxDiscAllocationPerConfiguration)
)
)
.Select(x => new Notification
{
ConfigurationId = x.configuration_id,
ConfigrationPath = x.configuration_path
CreatedDate = DateTime.UtcNow,
RuleType = (int)RulesEnum.TooMuchDisc,
})
.ToList();
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With