Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.net fastest way to find all files matching a pattern in all directories

Tags:

vb.net

I have ~500K files of *.ax5 that I must process and export to another format. Because of the large number of files and due to Windows performance issues with too many files in one folder they are buried in sub folders with other files of different extensions. In C# what is the fastest way to find every file contained in any level of subfolder under say C:\Sketch?

The folder structure is always the same AAAA\BB\CCCC_BLD[a bunch of different file types] after the initial run, i'd also like to only process the files with a write date greater than the last run date.

Alternatively, How can I quickly get the count of records found to display a percentage processed?

I cannot change the source structure of the files/folders that is set by the vendor

Here's what I have. I've tried both Array.ForEach and Parallel.ForEach both seem very slow.

Sub walkTree(ByVal directory As DirectoryInfo, ByVal pattern As String)
    Array.ForEach(directory.EnumerateFiles(pattern).ToArray(), Sub(fileInfo)
                                                                   Export(fileInfo)
                                                               End Sub)
    For Each subDir In directory.EnumerateDirectories()
        walkTree(subDir, pattern)    
    Next
End Sub
like image 663
Doug Chamberlain Avatar asked Jan 22 '14 15:01

Doug Chamberlain


1 Answers

http://msdn.microsoft.com/en-us/library/ms143316(v=vs.110).aspx

Directory.GetFiles(@"C:\Sketch", "*.ax5", SearchOption.AllDirectories);

Might be good enough for you?


As for performance, I doubt you will find any much faster ways to scan directories, since as @Mathew Foscarini points out, your disks are the bottleneck here.

If the directory is indexed, then it would be faster to use that as @jaccus mentions.


I took time to benchmark things a little. And it does actually seem like your able to get about a 33% performance gain on collecting files in an async way.

The test set I ran on might not match your situation, I don't know about how nested your files are etc... But what I did was create 5000 random files in each directory on every level (i settled for a single level though) and 100 directories amounting to 505.000 files...

I tested 3 methods of collecting files...

The simplest approach.

public class SimpleFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        return new List<string>( Directory.GetFiles(directory.FullName, pattern, SearchOption.AllDirectories));
    }
}

The "Dumb" approach, although this is only dumb if you know of the overload used in the Simple approach... Otherwise this is a perfectly fine solution.

public class DumbFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        List<string> files = new List<string>(500000);
        files.AddRange(directory.GetFiles(pattern).Select(file => file.FullName));

        foreach (DirectoryInfo dir in directory.GetDirectories())
        {
            files.AddRange(CollectFiles(dir, pattern));
        }
        return files;
    }
}

The Task API Approach...

public class ThreadedFileCollector
{
    public List<string> CollectFiles(DirectoryInfo directory, string pattern)
    {
        ConcurrentQueue<string> queue = new ConcurrentQueue<string>();
        InternalCollectFiles(directory, pattern, queue);
        return queue.AsEnumerable().ToList();
    }

    private void InternalCollectFiles(DirectoryInfo directory, string pattern, ConcurrentQueue<string> queue)
    {
        foreach (string result in directory.GetFiles(pattern).Select(file => file.FullName))
        {
            queue.Enqueue(result);
        }

        Task.WaitAll(directory
            .GetDirectories()
            .Select(dir => Task.Factory.StartNew(() => InternalCollectFiles(dir, pattern, queue))).ToArray());
    }
}

This is only a test of collecting all files. Not processing them, the processing would make sense to kick off to threads.

Here is the results on my system:

Simple Collector:
 - Pass 0: found 505000 files in 2847 ms
 - Pass 1: found 505000 files in 2865 ms
 - Pass 2: found 505000 files in 2860 ms
 - Pass 3: found 505000 files in 3061 ms
 - Pass 4: found 505000 files in 3006 ms
 - Pass 5: found 505000 files in 2807 ms
 - Pass 6: found 505000 files in 2849 ms
 - Pass 7: found 505000 files in 2789 ms
 - Pass 8: found 505000 files in 2790 ms
 - Pass 9: found 505000 files in 2788 ms
Average: 2866 ms

Dumb Collector:
 - Pass 0: found 505000 files in 5190 ms
 - Pass 1: found 505000 files in 5204 ms
 - Pass 2: found 505000 files in 5453 ms
 - Pass 3: found 505000 files in 5311 ms
 - Pass 4: found 505000 files in 5339 ms
 - Pass 5: found 505000 files in 5362 ms
 - Pass 6: found 505000 files in 5316 ms
 - Pass 7: found 505000 files in 5319 ms
 - Pass 8: found 505000 files in 5583 ms
 - Pass 9: found 505000 files in 5197 ms
Average: 5327 ms

Threaded Collector:
 - Pass 0: found 505000 files in 2152 ms
 - Pass 1: found 505000 files in 2102 ms
 - Pass 2: found 505000 files in 2022 ms
 - Pass 3: found 505000 files in 2030 ms
 - Pass 4: found 505000 files in 2075 ms
 - Pass 5: found 505000 files in 2120 ms
 - Pass 6: found 505000 files in 2030 ms
 - Pass 7: found 505000 files in 1980 ms
 - Pass 8: found 505000 files in 1993 ms
 - Pass 9: found 505000 files in 2120 ms
Average: 2062 ms

As a side note, @Konrad Kokosa suggested blocking for each directory to ensure not to kick off millions of thread, don't do that...

There is no reason for you to manage how many threads that will be active at a given time, let the Task frameworks standard scheduler handle that, it will do a much better job at balancing the number of threads based on the number of cores you have...

And if you really wan't to control it your self just because, implementing a custom scheduler would be a better option: http://msdn.microsoft.com/en-us/library/system.threading.tasks.taskscheduler(v=vs.110).aspx

like image 89
Jens Avatar answered Sep 29 '22 06:09

Jens