Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to iterate over a folder with a large number of files in PowerShell?

Tags:

powershell

I'm trying to write a script that would go through 1.6 million files in a folder and move them to the correct folder based on the file name.

The reason is that NTFS can't handle a large number of files within a single folder without a degrade in performance.

The script call "Get-ChildItem" to get all the items within that folder, and as you might expect, this consumes a lot of memory (about 3.8 GB).

I'm curious if there are any other ways to iterate through all the files in a directory without using up so much memory.

like image 279
T.Ho Avatar asked Sep 05 '12 02:09

T.Ho


2 Answers

If you do

$files = Get-ChildItem $dirWithMillionsOfFiles
#Now, process with $files

you WILL face memory issues.

Use PowerShell piping to process the files:

Get-ChildItem $dirWithMillionsOfFiles | %{ 
    #process here
}

The second way will consume less memory and should ideally not grow beyond a certain point.

like image 62
manojlds Avatar answered Sep 21 '22 08:09

manojlds


If you need to reduce the memory footprint, you can skip using Get-ChildItem and instead use a .NET API directly. I'm assuming you are on Powershell v2, if so first follow the steps here to enable .NET 4 to load in Powershell v2.

In .NET 4 there are some nice APIs for enumerating files and directories, as opposed to returning them in arrays.

[IO.Directory]::EnumerateFiles("C:\logs") |%{ <move file $_> }

By using this API, instead of [IO.Directory]::GetFiles(), only one file name will be processed at a time, so the memory consumption should be relatively small.

Edit

I was also assuming you had tried a simple pipelined approach like Get-ChildItem |ForEach { process }. If this is enough, I agree it's the way to go.

But I want to clear up a common misconception: In v2, Get-ChildItem (or really, the FileSystem provider) does not truly stream. The implementation uses the APIs Directory.GetDirectories and Directory.GetFiles, which in your case will generate a 1.6M-element array before any processing can occur. Once this is done, then yes, the remainder of the pipeline is streaming. And yes, this initial low-level piece has relatively minimal impact, since it is simply a string array, not an array of rich FileInfo objects. But it is incorrect to claim that O(1) memory is used in this pattern.

Powershell v3, in contrast, is built on .NET 4, and thus takes advantage of the streaming APIs I mention above (Directory.EnumerateDirectories and Directory.EnumerateFiles). This is a nice change, and helps in scenarios just like yours.

like image 33
latkin Avatar answered Sep 20 '22 08:09

latkin