Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve the performance for enumerating files and folders using .NET

Tags:

I have a base directory that contains several thousand folders. Inside of these folders there can be between 1 and 20 subfolders that contains between 1 and 10 files. I'd like to delete all files that are over 60 days old. I was using the code below to get the list of files that I would have to delete:

DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles = 
  dirInfo.GetFiles("*.*", SearchOption.AllDirectories)
    .Where(t=>t.CreationTime < DateTime.Now.AddDays(-60)).ToArray();

But I let this run for about 30 minutes and it still hasn't finished. I'm curious if anyone can see anyway that I could potentially improve the performance of the above line or if there is a different way I should be approaching this entirely for better performance? Suggestions?

like image 515
Abe Miessler Avatar asked Jul 19 '13 21:07

Abe Miessler


People also ask

How do I enumerate files and folders?

To enumerate directories and files, use methods that return an enumerable collection of directory or file names, or their DirectoryInfo, FileInfo, or FileSystemInfo objects. If you want to search and return only the names of directories or files, use the enumeration methods of the Directory class.

What is EnumerateFiles C#?

EnumerateFiles(String, String, EnumerationOptions) Returns an enumerable collection of full file names that match a search pattern and enumeration options in a specified path, and optionally searches subdirectories.

What is file enumeration?

File/parameter enumeration is a common technique used to search for suspicious files and parameter values in order to detect their existence or validity. Using this technique, it is possible to map additional parts of the application, which are not normally exposed to the public.


2 Answers

This is (probably) as good as it's going to get:

DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles = 
    dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
           .AsParallel()
           .Where(fi => fi.CreationTime < sixtyLess).ToArray();

Changes:

  • Made the the 60 days less DateTime constant, and therefore less CPU load.
  • Used EnumerateFiles.
  • Made the query parallel.

Should run in a smaller amount of time (not sure how much smaller).

Here is another solution which might be faster or slower than the first, it depends on the data:

DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles = 
     dirInfo.EnumerateDirectories()
            .AsParallel()
            .SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories)
                                .Where(fi => fi.CreationTime < sixtyLess))
            .ToArray();

Here it moves the parallelism to the main folder enumeration. Most of the changes from above apply too.

like image 119
It'sNotALie. Avatar answered Sep 19 '22 00:09

It'sNotALie.


A possibly faster alternative is to use WINAPI FindNextFile. There is an excellent Faster Directory Enumeration Tool for this. Which can be used as follows:

HashSet<FileData> GetPast60(string dir)
{
    DateTime retval = DateTime.Now.AddDays(-60);
    HashSet<FileData> oldFiles = new HashSet<FileData>();

    FileData [] files = FastDirectoryEnumerator.GetFiles(dir);
    for (int i=0; i<files.Length; i++)
    {
        if (files[i].LastWriteTime < retval)
        {
            oldFiles.Add(files[i]);
        }
    }    
    return oldFiles;
}

EDIT

So, based on comments below, I decided to do a benchmark of suggested solutions here as well as others I could think of. It was interesting enough to see that EnumerateFiles seemed to out-perform FindNextFile in C#, while EnumerateFiles with AsParallel was by far the fastest followed surprisingly by command prompt count. However do note that AsParallel wasn't getting the complete file count or was missing some files counted by the others so you could say the command prompt method is the best.

Applicable Config:

  • Windows 7 Service Pack 1 x64
  • Intel(R) Core(TM) i5-3210M CPU @2.50GHz 2.50GHz
  • RAM: 6GB
  • Platform Target: x64
  • No Optimization (NB: Compiling with optimization will produce drastically poor performance)
  • Allow UnSafe Code
  • Start Without Debugging

Below are three screenshots:

Run 1

Run 2

Run 3

I have included my test code below:

static void Main(string[] args)
{
    Console.Title = "File Enumeration Performance Comparison";
    Stopwatch watch = new Stopwatch();
    watch.Start();

    var allfiles = GetPast60("C:\\Users\\UserName\\Documents");
    watch.Stop();
    Console.WriteLine("Total time to enumerate using WINAPI =" + watch.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles);

    Stopwatch watch1 = new Stopwatch();
    watch1.Start();

    var allfiles1 = GetPast60Enum("C:\\Users\\UserName\\Documents\\");
    watch1.Stop();
    Console.WriteLine("Total time to enumerate using EnumerateFiles =" + watch1.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles1);

    Stopwatch watch2 = new Stopwatch();
    watch2.Start();

    var allfiles2 = Get1("C:\\Users\\UserName\\Documents\\");
    watch2.Stop();
    Console.WriteLine("Total time to enumerate using Get1 =" + watch2.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles2);


    Stopwatch watch3 = new Stopwatch();
    watch3.Start();

    var allfiles3 = Get2("C:\\Users\\UserName\\Documents\\");
    watch3.Stop();
    Console.WriteLine("Total time to enumerate using Get2 =" + watch3.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles3);

    Stopwatch watch4 = new Stopwatch();
    watch4.Start();

    var allfiles4 = RunCommand(@"dir /a: /b /s C:\Users\UserName\Documents");
    watch4.Stop();
    Console.WriteLine("Total time to enumerate using Command Prompt =" + watch4.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles4);


    Console.WriteLine("Press Any Key to Continue...");
    Console.ReadLine();
}

private static int RunCommand(string command)
{
    var process = new Process()
    {
        StartInfo = new ProcessStartInfo("cmd")
        {
            UseShellExecute = false,
            RedirectStandardInput = true,
            RedirectStandardOutput = true,
            CreateNoWindow = true,
            Arguments = String.Format("/c \"{0}\"", command),
        }
    };
    int count = 0;
    process.OutputDataReceived += delegate { count++; };
    process.Start();
    process.BeginOutputReadLine();

    process.WaitForExit();
    return count;
}

static int GetPast60Enum(string dir)
{
    return new DirectoryInfo(dir).EnumerateFiles("*.*", SearchOption.AllDirectories).Count();
}

private static int Get2(string myBaseDirectory)
{
    DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
    return dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
               .AsParallel().Count();
}

private static int Get1(string myBaseDirectory)
{
    DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
    return dirInfo.EnumerateDirectories()
               .AsParallel()
               .SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
               .Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}


private static int GetPast60(string dir)
{
    return FastDirectoryEnumerator.GetFiles(dir, "*.*", SearchOption.AllDirectories).Length;
}

NB: I concentrated on count in the benchmark not modified date.

like image 26
Chibueze Opata Avatar answered Sep 21 '22 00:09

Chibueze Opata