Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Call to IEnumerable.Count() takes multiple seconds

Tags:

c#

.net

I have created a little windows service which should delete all occurrences of a certain file name in certain folders. All this code runs in the elapsed-handler of the timer (intervall=10s).

When the service is running I can recognize a CPU increase up to 20% used by that service, so I examined my code, put some trace commands in it and found out that executing the handler took about 3-4 seconds for nothing.

I narrowed it down to the following piece of code: allReporterFiles.Count(). It's calling the method Count() of this IEnumerable and this call takes 3-4 seconds.

My project is setup for .NET 4.7.2. Is this a framework bug or what?

 var files1 = Directory.EnumerateFiles(dirSwReporter, swReporterFileName, SearchOption.AllDirectories);
 var files2 = Directory.EnumerateFiles(dirSwReporter2, swReporterFileName, SearchOption.AllDirectories);

 var allReporterFiles = files1.Union(files2);

 var sw = Stopwatch.StartNew();
    var fileCount = allReporterFiles.Count(); // <--- takes ~3.5 seconds
 sw.Stop();

 Trace.WriteLine($"KillChromeSoftwareReporterTool completed in: {sw.Elapsed.TotalMilliseconds}ms or  {sw.Elapsed.TotalSeconds}sec");
like image 403
Legends Avatar asked Oct 02 '19 20:10

Legends


2 Answers

Is this a framework bug or what?

It's an issue with your understanding of LINQ's deferred execution, I suspect.

allReporterFiles is just an IEnumerable<string>. Calling Count() means iterating over it - which in turn means the Union code iterating over files1 and files2. I suspect you have an awful lot of files.

The way to tell that is to measure how long it takes to iterate over files1 and files2 separately. One easy way to do that is to call ToList(). For example:

// The use of ToList forces the result to be materialized, rather than using deferred
// execution.

var stopwatch = Stopwatch.StartNew();
var files1 = Directory
    .EnumerateFiles(dirSwReporter, swReporterFileName, SearchOption.AllDirectories)
    .ToList();
var files1Time = stopwatch.Elapsed;

stopwatch.Restart();
var files2 = Directory
    .EnumerateFiles(dirSwReporter2, swReporterFileName, SearchOption.AllDirectories)
    .ToList();
var files2Time = stopwatch.Elapsed;

Then log files1Time and files2Time. Now that the content is in two lists, counting the Union won't involve any IO. It will still need to basically create a HashSet<string> as it goes, in order to avoid returning the same value more than once, but it will be much, much quicker.

This approach won't be any faster overall - and will use more memory - but it'll make it obvious whether most of the time is in searching in dirSwReporter or dirSwReporter2, which may be enough to help you optimize.

like image 187
Jon Skeet Avatar answered Sep 30 '22 01:09

Jon Skeet


The information about the deferred execution is buried in the Remarks sections at .NET Framework 4.7 Directory.EnumerateFiles Method's doco.

The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.

The part about efficiency is obviously irrelevant for your context since you're calling Count on the result which requires full enumeration.


BTW. .NET Framework 4.8 Directory.EnumerateFiles Method's doco states:

The returned collection is not cached; each call to the GetEnumerator on the collection will start a new enumeration.

like image 24
tymtam Avatar answered Sep 30 '22 01:09

tymtam