I'm trying to build a high performance PowerShell script that will scan all files & folders on an NTFS\SAN share. I don't need file data, just the metadata info like attributes, timestamps and permissions.
I can scan files & folders via a simple PowerShell script:
$path = "\\SomeNetworkServer\Root"
Get-ChildItem -Path $path -Recurse | % {
#Do some work like (Get-ACL -Path ($_.FullName))...
}
Problem with this approach is that it can take hours or even days on very large storage systems with millions of folders & files. I figured parallelism might help better utilize storage, CPU and networking IO to process faster. I'm not sure where (if at all) a bottleneck exists in Get-ChildItem. I assume it just runs sequentially through each directory.
My approach was to build a function that also uses recursion with the System.IO namespace to enumerate all directories and files, and also use RSJobs to create a pool of active running jobs based on the directory.
Something like:
function ExploreShare {
param(
[parameter(Mandatory = $true)]
[string] $path,
[parameter(Mandatory = $true)]
[int] $threadCount)
Begin {}
Process {
[System.IO.Directory]::EnumerateDirectories($path) | % {
$curDirectory = $_;
#Do work on current directory, output to pipline....
#Determine if some of the jobs finished
$jobs = Get-RSJob
$jobs | ? { $_.State -eq "Completed" } | Receive-RSJob
$jobs | ? { $_.State -eq "Completed" } | Remove-RSJob
$running = $jobs | ? { $_.State -eq "Running" };
#If we exceed our threadCount quota run as normal recursion
if ($running.count -gt $threadCount -or ($threadCount -eq 0)) {
ExploreShare -path $curDirectory -threadCount $threadCount
}
else {
#Create a new Job for the directory and its nested contents
Start-RSJob -InputObject $curDirectory -ScriptBlock {
ExploreShare -path $_ -threadCount $using:threadCount
} -FunctionsToImport @("ExploreShare") | Out-Null
}
}
#Process all files in current directory in current thread\job
[System.IO.Directory]::EnumerateFiles($path) | % {
$curFile = $_;
#Do work with current file, output to pipeline....
}
}
End {
#End of invocation wait for jobs to be finished and flush them out?
$jobs = Get-RSJob
$jobs | Wait-RSJob | Receive-RSJob
$jobs | Remove-RSJob
}
}
Then call it like so
ExploreShare -path $path -threadCount 5
Few design challenges I'm struggling to overcome:
When calling Start-RSJob, I believe that nested job has no awareness of the parent jobs resulting in an ever growing list of RSJobs per directory as ExploreShare is recursively called.
$jobs = Get-RSJobs and passing in $jobs to Start-RSJob & ExploreShare to build awareness of running jobs in a child job (not sure if it actually works that way), it just caused my process memory to skyrocket.If I keep the parallelism to just enumerated root directories, I can get in a situation where most of them at the root are empty except a few that contain all the nested files & folders (like Department folder)
My goal was to make it so there are always only X jobs running in parallel when ExploreShare is called, and if a nested directory is being enumerated when a previous job completed, then that directory (and its files) will be processed as a new job instead of current thread.
Hopefully this makes sense 🤷♂️
I would recommend you to use ThreadJob if doing multi-threading in Windows PowerShell 5.1:
Install-Module ThreadJob -Scope CurrentUser
Using a ConcurrentQueue<T> this is how your code would look like, but note, I'm certain that processing this in parallel will be slower than linear enumeration, for linear enumeration see the next example.
$threadLimit = 6 # How many threads should run concurrently ?
$queue = [System.Collections.Concurrent.ConcurrentQueue[System.IO.DirectoryInfo]]::new()
$item = Get-Item '\\SomeNetworkServer\Root'
$queue.Enqueue($item)
$target = $null
# assign the loop expression to a variable here if you need to capture output
# i.e.: `$result = while (....)`
while ($queue.TryDequeue([ref] $target)) {
try {
$enum = $target.EnumerateFileSystemInfos()
}
catch {
# if we cant enumerate this folder go next
Write-Warning $_
continue
}
$jobs = foreach ($item in $enum) {
Start-ThreadJob {
$queue = $using:queue
$item = $using:item
# do stuff with `$item` here (Get-Acl or something...)
# it can be a file or directory at this point
$item
# if `$item` is a directory
if ($item -is [System.IO.DirectoryInfo]) {
# enqueue it
$queue.Enqueue($item)
}
} -ThrottleLimit $threadLimit
}
# wait for them before processing the next set of stuff
$jobs | Receive-Job -Wait -AutoRemoveJob
}
For linear processing which in my opinion will for sure be faster than the example above, you can use a normal Queue<T> as there is no need to handle thread safety. In addition, this method is also faster than using Get-ChildItem -Recurse.
$queue = [System.Collections.Generic.Queue[System.IO.DirectoryInfo]]::new()
$item = Get-Item '\\SomeNetworkServer\Root'
$queue.Enqueue($item)
# assign the loop expression to a variable here if you need to capture output
# i.e.: `$result = while (....)`
while ($queue.Count) {
$target = $queue.Dequeue()
try {
$enum = $target.EnumerateFileSystemInfos()
}
catch {
# if we cant enumerate this folder go next
Write-Warning $_
continue
}
foreach ($item in $enum) {
# do stuff with `$item` here (Get-Acl or something...)
# it can be a file or directory at this point
$item
# if `$item` is a directory
if ($item -is [System.IO.DirectoryInfo]) {
# enqueue it
$queue.Enqueue($item)
}
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With