Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently counting files in directory and subfolders with specific name

I can count all the files in a folder and sub-folders, the folders themselves are not counted.

(gci -Path *Fill_in_path_here* -Recurse -File | where Name -like "*STB*").Count

However, powershell is too slow for the amount of files (up to 700k). I read that cmd is faster in executing this kind of task.

Unfortunately I have no knowledge of cmd code at all. In the example above I am counting all the files with STB in the file name.

That is what I would like to do in cmd as well.

Any help is appreciated.

like image 892
InPanic Avatar asked Dec 10 '22 04:12

InPanic


2 Answers

Theo's helpful answer based on direct use of .NET ([System.IO.Directory]::EnumerateFiles()) is the fastest option (in my tests; YMMV - see the benchmark code below[1]).

Its limitations in the .NET Framework (FullCLR) - on which Windows PowerShell is built - are:

  • An exception is thrown when an inaccessible directory is encountered (due to lack of permissions). You can catch the exception, but you cannot continue the enumeration; that is, you cannot robustly enumerate all items that you can access while ignoring those that you cannot.

  • Hidden items are invariably included.

  • With recursive enumeration, symlinks / junctions to directories are invariably followed.

By contrast, the cross-platform .NET Core framework, since v2.1 - on which PowerShell Core is built - offers ways around these limitations, via the EnumerationOptions options - see this answer for an example.

Note that you can also perform enumeration via the related [System.IO.DirectoryInfo] type, which - similar to Get-ChildItem - returns rich objects rather than mere path strings, allowing for much for versatile processing; e.g., to get an array of all file sizes (property .Length, implicitly applied to each file object):

([System.IO.DirectoryInfo] $somePath).EnumerateFiles('*STB*', 'AllDirectories').Length

A native PowerShell solution that addresses these limitations and is still reasonably fast is to use Get-ChildItem with the -Filter parameter.

(Get-ChildItem -LiteralPath $somePath -Filter *STB* -Recurse -File).Count
  • Hidden items are excluded by default; add -Force to include them.

  • To ignore permission problems, add -ErrorAction SilentlyContinue or -ErrorAction Ignore; the advantage of SilentlyContinue is that you can later inspect the $Error collection to determine the specific errors that occurred, so as to ensure that the errors truly only stem from permission problems.

    • Note that PowerShell Core - unlike Windows PowerShell - helpfully ignores the inability to enumerate the contents of the hidden system junctions that exist for pre-Vista compatibility only, such as $env:USERPROFILE\Cookies.
  • In Windows PowerShell, Get-ChildItem -Recurse invariably follows symlinks / junctions to directories, unfortunately; more sensibly, PowerShell Core by default does not, and offers opt-in via -FollowSymlink.

  • Like the [System.IO.DirectoryInfo]-based solution, Get-ChildItem outputs rich objects ([System.IO.FileInfo] / [System.IO.DirectoryInfo]) describing each enumerated file-system item, allowing for versatile processing.

Note that while you can also pass wildcard arguments to -Path (the implied first positional parameter) and -Include (as in TobyU's answer), it is only -Filter that provides significant speed improvements, due to filtering at the source (the filesystem driver), so that PowerShell only receives the already-filtered results; by contrast, -Path / -Include must first enumerate everything and match against the wildcard pattern afterwards.[2]

Caveats re -Filter use:

  • Its wildcard language is not the same as PowerShell's; notably, it doesn't support character sets/ranges (e.g. *[0-9]) and it has legacy quirks - see this answer.
  • It only supports a single wildcard pattern, whereas -Include supports multiple (as an array).

That said, -Filter processes wildcards the same way as cmd.exe's dir.


Finally, for the sake of completeness, you can adapt MC ND's helpful answer based on cmd.exe's dir command for use in PowerShell, which simplifies matters:

(cmd /c dir /s /b /a-d "$somePath/*STB*").Count

PowerShell captures an external program's stdout output as an array of lines, whose element count you can simply query with the .Count (or .Length) property.

That said, this may or may not be faster than PowerShell's own Get-ChildItem -Filter, depending on the filtering scenario; also note that dir /s can only ever return path strings, whereas Get-ChildItem returns rich objects whose properties you can query.

Caveats re dir use:

  • /a-d excludes directories, i.e., only reports files, but then also includes hidden files, which dir doesn't do by default.

  • dir /s invariably descends into hidden directories too during the recursive enumeration; an /a (attribute-based) filter is only applied to the leaf items of the enumeration (only to files in this case).

  • dir /s invariably follows symlinks / junctions to other directories (assuming it has the requisite permissions - see next point).

  • dir /s quietly ignores directories or symlinks / junctions to directories if it cannot enumerate their contents due to lack of permissions - while this is helpful in the specific case of the aforementioned hidden system junctions (you can find them all with cmd /c dir C:\ /s /ashl), it can cause you to miss the content of directories that you do want to enumerate, but can't for true lack of permissions, because dir /s will give no indication that such content may even exist (if you directly target an inaccessible directory, you get a somewhat misleading File Not Found error message, and the exit code is set to 1).


Performance comparison:

  • The following tests compare pure enumeration performance without filtering, for simplicity, using a sizable directory tree assumed to be present on all systems, c:\windows\winsxs; that said, it's easy to adapt the tests to also compare filtering performance.

  • The tests are run from PowerShell, which means that some overhead is introduced by creating a child process for cmd.exe in order to invoke dir /s, though (a) that overhead should be relatively low and (b) the larger point is that staying in the realm of PowerShell is well worthwhile, given its vastly superior capabilities compared to cmd.exe.

  • The tests use function Time-Command, which can be downloaded from this Gist, which averages 10 runs by default.

# Warm up the filesystem cache for the target dir.,
# both from PowerShell and cmd.exe, to be safe.
gci 'c:\windows\winsxs' -rec >$null; cmd /c dir /s 'c:\windows\winsxs' >$null

Time-Command `
  { @([System.IO.Directory]::EnumerateFiles('c:\windows\winsxs', '*', 'AllDirectories')).Count },
  { (Get-ChildItem -Force -Recurse -File 'c:\windows\winsxs').Count },
  { (cmd /c dir /s /b /a-d 'c:\windows\winsxs').Count },
  { cmd /c 'dir /s /b /a-d c:\windows\winsxs | find /c /v """"' }

On my single-core VMWare Fusion VM with Windows PowerShell v5.1.17134.407 on Microsoft Windows 10 Pro (64-bit; Version 1803, OS Build: 17134.523) I get the following timings, from fastest to slowest (scroll to the right to see the Factor column to show relative performance):

Command                                                                                    Secs (10-run avg.) TimeSpan         Factor
-------                                                                                    ------------------ --------         ------
@([System.IO.Directory]::EnumerateFiles('c:\windows\winsxs', '*', 'AllDirectories')).Count 11.016             00:00:11.0158660 1.00
(cmd /c dir /s /b /a-d 'c:\windows\winsxs').Count                                          15.128             00:00:15.1277635 1.37
cmd /c 'dir /s /b /a-d c:\windows\winsxs | find /c /v """"'                                16.334             00:00:16.3343607 1.48
(Get-ChildItem -Force -Recurse -File 'c:\windows\winsxs').Count                            24.525             00:00:24.5254979 2.23

Interestingly, both [System.IO.Directory]::EnumerateFiles() and the Get-ChildItem solution are significantly faster in PowerShell Core, which runs on top of .NET Core (as of PowerShell Core 6.2.0-preview.4, .NET Core 2.1):

Command                                                                                    Secs (10-run avg.) TimeSpan         Factor
-------                                                                                    ------------------ --------         ------
@([System.IO.Directory]::EnumerateFiles('c:\windows\winsxs', '*', 'AllDirectories')).Count 5.094              00:00:05.0940364 1.00
(cmd /c dir /s /b /a-d 'c:\windows\winsxs').Count                                          12.961             00:00:12.9613440 2.54
cmd /c 'dir /s /b /a-d c:\windows\winsxs | find /c /v """"'                                14.999             00:00:14.9992965 2.94
(Get-ChildItem -Force -Recurse -File 'c:\windows\winsxs').Count                            16.736             00:00:16.7357536 3.29

[1] [System.IO.Directory]::EnumerateFiles() is inherently and undoubtedly faster than a Get-ChildItem solution. In my tests (see section "Performance comparison:" above), [System.IO.Directory]::EnumerateFiles() beat out cmd /c dir /s as well, slightly in Windows PowerShell, and clearly so in PowerShell Core, but others report different findings. That said, finding the overall fastest solution is not the only consideration, especially if more than just counting files is needed and if the enumeration needs to be robust. This answer discusses the tradeoffs of the various solutions.

[2] In fact, due to an inefficient implementation as of Windows PowerShell v5.1 / PowerShell Core 6.2.0-preview.4, use of -Path and -Include is actually slower than using Get-ChildItem unfiltered and instead using an additional pipeline segment with ... | Where-Object Name -like *STB*, as in the OP - see this GitHub issue.

like image 192
mklement0 Avatar answered May 12 '23 09:05

mklement0


One of the fastest ways to do it in cmd command line or batch file could be

dir "x:\some\where\*stb*" /s /b /a-d | find /c /v ""

Just a recursive (/s) dir command to list all files (no folders /a-d) in bare format (/b), with all the output piped to find command that will count (/c) the number of non empty lines (/v "")

But, in any case, you will need to enumerate the files and it requires time.

edited to adapt to comments, BUT

note The approach below does not work for this case because, at least in windows 10, the space padding in the summary lines of the dir command is set to five positions. File counts greater than 99999 are not correctly padded, so the sort /r output is not correct.

As pointed by Ben Personick, the dir command also outputs the number of files and we can retrieve this information:

@echo off
    setlocal enableextensions disabledelayedexpansion

    rem Configure where and what to search
    set "search=x:\some\where\*stb*"

    rem Retrieve the number of files
    set "numFiles=0"
    for /f %%a in ('
        dir "%search%" /s /a-d /w 2^>nul        %= get the list of the files        =%
        ^| findstr /r /c:"^  *[1-9]"            %= retrieve only summary lines      =%
        ^| sort /r 2^>nul                       %= reverse sort, greater line first =%
        ^| cmd /e /v /c"set /p .=&&echo(!.!"    %= retrieve only first line         =%
    ') do set "numFiles=%%a"

    echo File(s) found: %numFiles% 

The basic idea is use a serie of piped commands to handle different parts of data retrieval:

  • Use a dir command to generate the list of files (/w is included just to generate less lines).
  • As we only want summary lines with the number of files found, findstr is used to retrieve only that lines starting with spaces (the header/summary lines) and a number greater than 0 (file count summary lines, as we are using /a-d the directory count summary lines will have a value of 0).
  • Sort the lines in reverse order to end with the greater line first (summary lines start is a left space padded number). Greater line (final file count or equivalent) will be the first line.
  • Retrieve only this line using a set /p command in a separate cmd instance. As the full sequence is wrapped in a for /f and it has a performance problem when retrieving long lists from command execution, we will try to retrieve as little as possible.

The for /f will tokenize the retrieved line, get the first token (number of files) and set the variable used to hold the data (variable has been initialized, it is possible that no file could be found).

like image 38
MC ND Avatar answered May 12 '23 09:05

MC ND