Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a faster way than this to find all the files in a directory and all sub directories?

I'm writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a network drive, so performance is a bit of an issue.

Here's the recursive method I'm using now:

private void GetFileList(string fileSearchPattern, string rootFolderPath, List<FileInfo> files)
{
    DirectoryInfo di = new DirectoryInfo(rootFolderPath);

    FileInfo[] fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    files.AddRange(fiArr);

    DirectoryInfo[] diArr = di.GetDirectories();

    foreach (DirectoryInfo info in diArr)
    {
        GetFileList(fileSearchPattern, info.FullName, files);
    }
}

I could set the SearchOption to AllDirectories and not use a recursive method, but in the future I'll want to insert some code to notify the user what folder is currently being scanned.

While I'm creating a list of FileInfo objects now all I really care about is the paths to the files. I'll have an existing list of files, which I want to compare to the new list of files to see what files were added or deleted. Is there any faster way to generate this list of file paths? Is there anything that I can do to optimize this file search around querying for the files on a shared network drive?


Update 1

I tried creating a non-recursive method that does the same thing by first finding all the sub directories and then iteratively scanning each directory for files. Here's the method:

public static List<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);

    List<DirectoryInfo> dirList = new List<DirectoryInfo>(rootDir.GetDirectories("*", SearchOption.AllDirectories));
    dirList.Add(rootDir);

    List<FileInfo> fileList = new List<FileInfo>();

    foreach (DirectoryInfo dir in dirList)
    {
        fileList.AddRange(dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly));
    }

    return fileList;
}

Update 2

Alright so I've run some tests on a local and a remote folder both of which have a lot of files (~1200). Here are the methods I've run the tests on. The results are below.

  • GetFileListA(): Non-recursive solution in the update above. I think it's equivalent to Jay's solution.
  • GetFileListB(): Recursive method from the original question
  • GetFileListC(): Gets all the directories with static Directory.GetDirectories() method. Then gets all the file paths with the static Directory.GetFiles() method. Populates and returns a List
  • GetFileListD(): Marc Gravell's solution using a queue and returns IEnumberable. I populated a List with the resulting IEnumerable
    • DirectoryInfo.GetFiles: No additional method created. Instantiated a DirectoryInfo from the root folder path. Called GetFiles using SearchOption.AllDirectories
  • Directory.GetFiles: No additional method created. Called the static GetFiles method of the Directory using using SearchOption.AllDirectories
Method                       Local Folder       Remote Folder
GetFileListA()               00:00.0781235      05:22.9000502
GetFileListB()               00:00.0624988      03:43.5425829
GetFileListC()               00:00.0624988      05:19.7282361
GetFileListD()               00:00.0468741      03:38.1208120
DirectoryInfo.GetFiles       00:00.0468741      03:45.4644210
Directory.GetFiles           00:00.0312494      03:48.0737459

. . .so looks like Marc's is the fastest.

like image 661
Eric Anastas Avatar asked Jan 21 '10 04:01

Eric Anastas


People also ask

How do you get a list of all files in a directory?

Start -> Run -> Type in “cmd” This will open the command window. Next I will have to move into the correct directory. On my computer, the default directory is on the C: drive, but the folder I want to list the files for is on the D: drive, the first thing I will see is the prompt “C:\>”.

How do I get a list of folders in a folder?

You can use the DIR command by itself (just type “dir” at the Command Prompt) to list the files and folders in the current directory.


11 Answers

Try this iterator block version that avoids recursion and the Info objects:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    Queue<string> pending = new Queue<string>();
    pending.Enqueue(rootFolderPath);
    string[] tmp;
    while (pending.Count > 0)
    {
        rootFolderPath = pending.Dequeue();
        try
        {
            tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
        }
        catch (UnauthorizedAccessException)
        {
            continue;
        }
        for (int i = 0; i < tmp.Length; i++)
        {
            yield return tmp[i];
        }
        tmp = Directory.GetDirectories(rootFolderPath);
        for (int i = 0; i < tmp.Length; i++)
        {
            pending.Enqueue(tmp[i]);
        }
    }
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)

like image 78
Marc Gravell Avatar answered Oct 07 '22 05:10

Marc Gravell


Cool question.

I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%

I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.

Here is the meat of it

private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
    var rootDir = new DirectoryInfo(rootFolderPath);
    var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);

    return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
           select directoriesWithFiles;
}

private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
    foreach (DirectoryInfo dir in dirList)
    {
        yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    }
}
like image 24
Brad Cunningham Avatar answered Oct 07 '22 05:10

Brad Cunningham


The short answer of how to improve the performance of that code is: You cant.

The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)

Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.

First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.

Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.

In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
{
    return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
}

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
{
    return depth == 0
        ? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
        : directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
            directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
}

On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.

like image 37
Paul Rohde Avatar answered Oct 07 '22 06:10

Paul Rohde


This takes 30 seconds to get 2 million file names that meet the filter. The reason this is so fast is because I am only performing 1 enumeration. Each additional enumeration affects performance. The variable length is open to your interpretation and not necessarily related to the enumeration example.

if (Directory.Exists(path))
{
    files = Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories)
    .Where(s => s.EndsWith(".xml") || s.EndsWith(".csv"))
    .Select(s => s.Remove(0, length)).ToList(); // Remove the Dir info.
}
like image 29
Kentonbmax Avatar answered Oct 07 '22 04:10

Kentonbmax


I recently (2020) discovered this post because of a need to count files and directories across slow connections, and this was the fastest implementation I could come up with. The .NET enumeration methods (GetFiles(), GetDirectories()) perform a lot of under-the-hood work that slows them down tremendously by comparison.

This solution does not return FileInfo objects, but could be modified to do so—or potentially only return the relevant data needed from a custom FileInfo object.

This solution utilizes the Win32 API and .NET's Parallel.ForEach() to leverage the threadpool to maximize performance.

P/Invoke:

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
    string lpFileName,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
    IntPtr hFindFile,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
    IntPtr hFindFile
    );

Method:

public static Tuple<long, long> CountFilesDirectories(
    string path,
    CancellationToken token
    )
{
    if (String.IsNullOrWhiteSpace(path))
        throw new ArgumentNullException("path", "The provided path is NULL or empty.");

    // If the provided path doesn't end in a backslash, append one.
    if (path.Last() != '\\')
        path += '\\';

    IntPtr hFile = IntPtr.Zero;
    Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();

    long files = 0;
    long dirs = 0;

    try
    {
        hFile = Win32.Kernel32.FindFirstFile(
            path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
            ref fd
            );

        // If we encounter an error, or there are no files/directories, we return no entries.
        if (hFile.ToInt64() == -1)
            return Tuple.Create<long, long>(0, 0);

        //
        // Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
        //

        List<string> directories = new List<string>();

        do
        {
            // If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
            // count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
            if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
                (fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
                fd.cFileName != "." && fd.cFileName != "..")
            {
                directories.Add(System.IO.Path.Combine(path, fd.cFileName));
                dirs++;
            }
            // Otherwise, if this is a file ("archive"), increment the file count.
            else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
            {
                files++;
            }
        }
        while (Win32.Kernel32.FindNextFile(hFile, ref fd));

        // Iterate over each discovered directory in parallel to maximize file/directory counting performance,
        // calling itself recursively to traverse each directory completely.
        Parallel.ForEach(
            directories,
            new ParallelOptions()
            {
                CancellationToken = token
            },
            directory =>
            {
                var count = CountFilesDirectories(
                    directory,
                    token
                    );

                lock (directories)
                {
                    files += count.Item1;
                    dirs += count.Item2;
                }
            });
    }
    catch (Exception)
    {
        // Handle as desired.
    }
    finally
    {
        if (hFile.ToInt64() != 0)
            Win32.Kernel32.FindClose(hFile);
    }

    return Tuple.Create<long, long>(files, dirs);
}

On my local system, the performance of GetFiles()/GetDirectories() can be close to this, but across slower connections (VPNs, etc.) I found that this is tremendously faster—45 minutes vs. 90 seconds to access a remote directory of ~40k files, ~40 GB in size.

This can also fairly easily be modified to include other data, like the total file size of all files counted, or rapidly recursing through and deleting empty directories, starting at the furthest branch.

like image 38
Justin Shidell Avatar answered Oct 07 '22 04:10

Justin Shidell


The BCL methods are portable so to speak. If staying 100% managed I believe the best you can do is calling GetDirectories/Folders while checking access rights (or possibly not checking the rights and having another thread ready to go when the first one takes a little too long - a sign that it's about to throw UnauthorizedAccess exception - this might be avoidable with exception filters using VB or as of today unreleased c#).

If you want faster than GetDirectories you have to call win32 (findsomethingEx etc) which provides specific flags that allow ignore possibly unnecessary IO while traversing the MFT structures. Also if the drive is a network share, there can be a great speedup by similar approach but this time avoiding also excessive network roundtrips.

Now if you have admin and use ntfs and are in a real hurry with millions of files to go through, the absolute fastest way to go through them (assuming spinning rust where the disk latency kills) is use of both mft and journaling in combination, essentially replacing the indexing service with one that's targeted for your specific need. If you only need to find filenames and not sizes (or sizes too but then you must cache them and use journal to notice changes), this approach could allow for practically instant search of tens of millions of files and folders if implemented ideally. There may be one or two paywares that have bothered with this. There's samples of both MFT (DiscUtils) and journal reading (google) in C# around. I only have about 5 million files and just using NTFSSearch is good enough for that amount as it takes about 10-20 seconds to search them. With journal reading added it would go down to <3 seconds for that amount.

like image 23
Anonymous Coward Avatar answered Oct 07 '22 05:10

Anonymous Coward


DirectoryInfo seems to give much more information than you need, try piping a dir command and parsing the info from that.

like image 22
user2385360 Avatar answered Oct 07 '22 05:10

user2385360


Consider splitting the updated method into two iterators:

private static IEnumerable<DirectoryInfo> GetDirs(string rootFolderPath)
{
     DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
     yield return rootDir;

     foreach(DirectoryInfo di in rootDir.GetDirectories("*", SearchOption.AllDirectories));
     {
          yield return di;
     }
     yield break;
}

public static IEnumerable<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
{
     var allDirs = GetDirs(rootFolderPath);
     foreach(DirectoryInfo di in allDirs())
     {
          var files = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
          foreach(FileInfo fi in files)
          {
               yield return fi;
          }
     }
     yield break;
}

Also, further to the network-specific scenario, if you were able to install a small service on that server that you could call into from a client machine, you'd get much closer to your "local folder" results, because the search could execute on the server and just return the results to you. This would be your biggest speed boost in the network folder scenario, but may not be available in your situation. I've been using a file synchronization program that includes this option -- once I installed the service on my server the program became WAY faster at identifying the files that were new, deleted, and out-of-sync.

like image 24
Jay Avatar answered Oct 07 '22 06:10

Jay


Try Parallel programming:

private string _fileSearchPattern;
private List<string> _files;
private object lockThis = new object();

public List<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    _fileSearchPattern = fileSearchPattern;
    AddFileList(rootFolderPath);
    return _files;
}

private void AddFileList(string rootFolderPath)
{
    var files = Directory.GetFiles(rootFolderPath, _fileSearchPattern);
    lock (lockThis)
    {
        _files.AddRange(files);
    }

    var directories = Directory.GetDirectories(rootFolderPath);

    Parallel.ForEach(directories, AddFileList); // same as Parallel.ForEach(directories, directory => AddFileList(directory));
}
like image 22
Jaider Avatar answered Oct 07 '22 06:10

Jaider


I had the same problem. Here is my attempt which is a lot faster than calling Directory.EnumerateFiles, Directory.EnumerateDirectories or Directory.EnumerateFileSystemEntries recursive:

public static IEnumerable<string> EnumerateDirectoriesRecursive(string directoryPath)
{
    return EnumerateFileSystemEntries(directoryPath).Where(e => e.isDirectory).Select(e => e.EntryPath);
}

public static IEnumerable<string> EnumerateFilesRecursive(string directoryPath)
{
    return EnumerateFileSystemEntries(directoryPath).Where(e => !e.isDirectory).Select(e => e.EntryPath);
}

public static IEnumerable<(string EntryPath, bool isDirectory)> EnumerateFileSystemEntries(string directoryPath)
{
    Stack<string> directoryStack = new Stack<string>(new[] { directoryPath });

    while (directoryStack.Any())
    {
        foreach (string fileSystemEntry in Directory.EnumerateFileSystemEntries(directoryStack.Pop()))
        {
            bool isDirectory = (File.GetAttributes(fileSystemEntry) & (FileAttributes.Directory | FileAttributes.ReparsePoint)) == FileAttributes.Directory;

            yield return (fileSystemEntry, isDirectory);

            if (isDirectory)
                directoryStack.Push(fileSystemEntry);
        }
    }
}

You can modify the code to search for specific files or directories easily.

Regards

like image 29
Scordo Avatar answered Oct 07 '22 04:10

Scordo


I needed to get all files from my C: partition so i combined Marc and Jaider answers and got function with no recursion and with parallel programming with result of about 370k files processed in 30 seconds. Maybe this will help someone:

void DirSearch(string path)
    {
        ConcurrentQueue<string> pendingQueue = new ConcurrentQueue<string>();
        pendingQueue.Enqueue(path);

        ConcurrentBag<string> filesNames = new ConcurrentBag<string>();
        while(pendingQueue.Count > 0)
        {
            try
            {
                pendingQueue.TryDequeue(out path);

                var files = Directory.GetFiles(path);

                Parallel.ForEach(files, x => filesNames.Add(x));

                var directories = Directory.GetDirectories(path);

                Parallel.ForEach(directories, (x) => pendingQueue.Enqueue(x));
            }
            catch (Exception)
            {
                continue;
            }
        }
    }
like image 28
Hazaaa Avatar answered Oct 07 '22 04:10

Hazaaa