Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recursive parallelism with PowerShell jobs to explore file shares?

I'm trying to build a high performance PowerShell script that will scan all files & folders on an NTFS\SAN share. I don't need file data, just the metadata info like attributes, timestamps and permissions.

I can scan files & folders via a simple PowerShell script:

$path = "\\SomeNetworkServer\Root"
Get-ChildItem -Path $path -Recurse | % {

   #Do some work like (Get-ACL -Path ($_.FullName))...

}

Problem with this approach is that it can take hours or even days on very large storage systems with millions of folders & files. I figured parallelism might help better utilize storage, CPU and networking IO to process faster. I'm not sure where (if at all) a bottleneck exists in Get-ChildItem. I assume it just runs sequentially through each directory.

My approach was to build a function that also uses recursion with the System.IO namespace to enumerate all directories and files, and also use RSJobs to create a pool of active running jobs based on the directory.

Something like:

function ExploreShare {
    param(
        [parameter(Mandatory = $true)]
        [string] $path, 
        
        [parameter(Mandatory = $true)]
        [int] $threadCount)
    Begin {}
    Process {
        [System.IO.Directory]::EnumerateDirectories($path) | % {
            $curDirectory = $_;
            
            #Do work on current directory, output to pipline....
            

            #Determine if some of the jobs finished
            $jobs = Get-RSJob
            $jobs | ? { $_.State -eq "Completed" } | Receive-RSJob
            $jobs | ? { $_.State -eq "Completed" } | Remove-RSJob
            $running = $jobs | ? { $_.State -eq "Running" };
            
            #If we exceed our threadCount quota run as normal recursion
            if ($running.count -gt $threadCount -or ($threadCount -eq 0)) {
                ExploreShare -path $curDirectory -threadCount $threadCount
            }
            else {
                #Create a new Job for the directory and its nested contents
                Start-RSJob -InputObject $curDirectory -ScriptBlock {
                    ExploreShare -path $_ -threadCount $using:threadCount
                } -FunctionsToImport @("ExploreShare") | Out-Null
            }
        }

        #Process all files in current directory in current thread\job
        [System.IO.Directory]::EnumerateFiles($path) | % {
            $curFile = $_;

            #Do work with current file, output to pipeline....
        }
    }
    End {
        #End of invocation wait for jobs to be finished and flush them out?
        $jobs = Get-RSJob
        $jobs | Wait-RSJob | Receive-RSJob
        $jobs | Remove-RSJob
    }
}

Then call it like so ExploreShare -path $path -threadCount 5

Few design challenges I'm struggling to overcome:

  1. When calling Start-RSJob, I believe that nested job has no awareness of the parent jobs resulting in an ever growing list of RSJobs per directory as ExploreShare is recursively called.

    • I tried doing something like $jobs = Get-RSJobs and passing in $jobs to Start-RSJob & ExploreShare to build awareness of running jobs in a child job (not sure if it actually works that way), it just caused my process memory to skyrocket.
  2. If I keep the parallelism to just enumerated root directories, I can get in a situation where most of them at the root are empty except a few that contain all the nested files & folders (like Department folder)


My goal was to make it so there are always only X jobs running in parallel when ExploreShare is called, and if a nested directory is being enumerated when a previous job completed, then that directory (and its files) will be processed as a new job instead of current thread.

Hopefully this makes sense 🤷‍♂️

like image 842
The Unique Paul Smith Avatar asked Dec 31 '25 15:12

The Unique Paul Smith


1 Answers

I would recommend you to use ThreadJob if doing multi-threading in Windows PowerShell 5.1:

Install-Module ThreadJob -Scope CurrentUser

Using a ConcurrentQueue<T> this is how your code would look like, but note, I'm certain that processing this in parallel will be slower than linear enumeration, for linear enumeration see the next example.

$threadLimit = 6 # How many threads should run concurrently ?
$queue = [System.Collections.Concurrent.ConcurrentQueue[System.IO.DirectoryInfo]]::new()
$item = Get-Item '\\SomeNetworkServer\Root'
$queue.Enqueue($item)

$target = $null

# assign the loop expression to a variable here if you need to capture output
# i.e.: `$result = while (....)`
while ($queue.TryDequeue([ref] $target)) {
    try {
        $enum = $target.EnumerateFileSystemInfos()
    }
    catch {
        # if we cant enumerate this folder go next
        Write-Warning $_
        continue
    }

    $jobs = foreach ($item in $enum) {
        Start-ThreadJob {
            $queue = $using:queue
            $item = $using:item

            # do stuff with `$item` here (Get-Acl or something...)
            # it can be a file or directory at this point
            $item

            # if `$item` is a directory
            if ($item -is [System.IO.DirectoryInfo]) {
                # enqueue it
                $queue.Enqueue($item)
            }
        } -ThrottleLimit $threadLimit
    }

    # wait for them before processing the next set of stuff
    $jobs | Receive-Job -Wait -AutoRemoveJob
}

For linear processing which in my opinion will for sure be faster than the example above, you can use a normal Queue<T> as there is no need to handle thread safety. In addition, this method is also faster than using Get-ChildItem -Recurse.

$queue = [System.Collections.Generic.Queue[System.IO.DirectoryInfo]]::new()
$item = Get-Item '\\SomeNetworkServer\Root'
$queue.Enqueue($item)

# assign the loop expression to a variable here if you need to capture output
# i.e.: `$result = while (....)`
while ($queue.Count) {
    $target = $queue.Dequeue()
    try {
        $enum = $target.EnumerateFileSystemInfos()
    }
    catch {
        # if we cant enumerate this folder go next
        Write-Warning $_
        continue
    }

    foreach ($item in $enum) {
        # do stuff with `$item` here (Get-Acl or something...)
        # it can be a file or directory at this point
        $item

        # if `$item` is a directory
        if ($item -is [System.IO.DirectoryInfo]) {
            # enqueue it
            $queue.Enqueue($item)
        }
    }
}
like image 185
Santiago Squarzon Avatar answered Jan 02 '26 10:01

Santiago Squarzon



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!