I have created a little windows service which should delete all occurrences of a certain file name in certain folders.
All this code runs in the elapsed-handler
of the timer (intervall=10s).
When the service is running I can recognize a CPU increase up to 20% used by that service, so I examined my code, put some trace commands in it and found out that executing the handler took about 3-4 seconds for nothing.
I narrowed it down to the following piece of code: allReporterFiles.Count()
.
It's calling the method Count()
of this IEnumerable
and this call takes 3-4 seconds.
My project is setup for .NET 4.7.2. Is this a framework bug or what?
var files1 = Directory.EnumerateFiles(dirSwReporter, swReporterFileName, SearchOption.AllDirectories);
var files2 = Directory.EnumerateFiles(dirSwReporter2, swReporterFileName, SearchOption.AllDirectories);
var allReporterFiles = files1.Union(files2);
var sw = Stopwatch.StartNew();
var fileCount = allReporterFiles.Count(); // <--- takes ~3.5 seconds
sw.Stop();
Trace.WriteLine($"KillChromeSoftwareReporterTool completed in: {sw.Elapsed.TotalMilliseconds}ms or {sw.Elapsed.TotalSeconds}sec");
Is this a framework bug or what?
It's an issue with your understanding of LINQ's deferred execution, I suspect.
allReporterFiles
is just an IEnumerable<string>
. Calling Count()
means iterating over it - which in turn means the Union
code iterating over files1
and files2
. I suspect you have an awful lot of files.
The way to tell that is to measure how long it takes to iterate over files1
and files2
separately. One easy way to do that is to call ToList()
. For example:
// The use of ToList forces the result to be materialized, rather than using deferred
// execution.
var stopwatch = Stopwatch.StartNew();
var files1 = Directory
.EnumerateFiles(dirSwReporter, swReporterFileName, SearchOption.AllDirectories)
.ToList();
var files1Time = stopwatch.Elapsed;
stopwatch.Restart();
var files2 = Directory
.EnumerateFiles(dirSwReporter2, swReporterFileName, SearchOption.AllDirectories)
.ToList();
var files2Time = stopwatch.Elapsed;
Then log files1Time
and files2Time
. Now that the content is in two lists, counting the Union
won't involve any IO. It will still need to basically create a HashSet<string>
as it goes, in order to avoid returning the same value more than once, but it will be much, much quicker.
This approach won't be any faster overall - and will use more memory - but it'll make it obvious whether most of the time is in searching in dirSwReporter
or dirSwReporter2
, which may be enough to help you optimize.
The information about the deferred execution is buried in the Remarks sections at .NET Framework 4.7 Directory.EnumerateFiles Method's doco.
The
EnumerateFiles
andGetFiles
methods differ as follows: When you useEnumerateFiles
, you can start enumerating the collection of names before the whole collection is returned; when you useGetFiles
, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories,EnumerateFiles
can be more efficient.
The part about efficiency is obviously irrelevant for your context since you're calling Count
on the result which requires full enumeration.
BTW. .NET Framework 4.8 Directory.EnumerateFiles Method's doco states:
The returned collection is not cached; each call to the GetEnumerator on the collection will start a new enumeration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With