I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'm just trying to understand why it's THAT high. Here's what I'm doing below - just a stripped down simplistic snippet of another script that anyone can run. Notes/Questions <ol> <li>Not complaining, trying to understand why so much usage and is there a better way of doing this or managing memory more efficiently to respect the system I'm running this on.</li> <li>This same behavior occurs in PowerShell 5.1 and on PowerShell 7, RC3 just released. I don't think this is a bug, simply another opportunity for me to learn more.</li> <li>My overall objective with this is to run a foreach loop to check another much smaller array against this array for matches or lack there of.</li> </ol> My Test Code <pre class="prettyprint"><code>Invoke-WebRequest -uri "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" -OutFile C:\top-1m.csv.zip Expand-Archive -Path C:\top-1m.csv.zip -DestinationPath C:\top-1m.csv $alexaTopMillion = Import-Csv -Path C:\top-1m.csv </code></pre> To anyone answering this: Thank you for your time and helping me learn more every day!

Generally speaking, iRon's advice in a comment on the question is worth heeding (the specific question is addressed in the section that follows this one): <blockquote> To keep memory usage low, use streaming of objects in the pipeline rather than collecting them in memory first - if feasible. </blockquote> That is, instead of doing this: <pre class="prettyprint lang-bash prettyprint-override"><code># !! Collects ALL objects in memory, as an array. $rows = Import-Csv in.csv foreach ($row in $rows) { ... } </code></pre> do this: <pre class="prettyprint lang-bash prettyprint-override"><code># Process objects ONE BY ONE. # As long as you stream to a *file* or some other output stream # (as opposed to assigning to a *variable*), memory use should remain constant, # except for temporarily held memory awaiting garbage collection. Import-Csv in.csv | ForEach-Object { ... } # pipe to Export-Csv, for instance </code></pre> However, even then you seemingly can run out of memory with very large files - see this question - possibly related to buildup of memory from no longer needed objects that haven't yet been garbage-collected; therefore, periodically calling <code>[GC]::Collect()</code> in the <code>ForEach-Object</code> script block may solve the problem - see this answer for an example. <hr> <h3>If you do need to collect all objects output by <code>Import-Csv</code> in memory at once:</h3> The inordinate memory use you observe comes from how <code>[pscustomobject]</code> instances (<code>Import-Csv</code>'s output type) are implemented, as discussed in GitHub issue #7603 (emphasis added): <blockquote> The memory pressure most likely comes from the cost of <code>PSNoteProperty</code> [which is how <code>[pscustomobject]</code> properties are implemented]. Each <code>PSNoteProperty</code> has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive. </blockquote> The same issue proposes a workaround to reduce memory consumption (as also shown in Wasif Hasan's answer): <ul> <li> Read the first CVS row and dynamically create a custom class that represents the rows, using <code>Invoke-Expression</code>. <ul> <li> Note: While its use is safe here, <code>Invoke-Expression</code> should generally be avoided. </li> <li> If you know the column structure in advance, you can create a custom <code>class</code> the regular way, which also allows you to use appropriate data types for the properties (which are otherwise all strings by default); for instance, defining appropriate properties as <code>[int]</code> (<code>System.Int32</code>) further reduces memory consumption. </li> </ul> </li> <li> Pipe <code>Import-Csv</code> to a <code>ForEach-Object</code> call that converts each <code>[pscustomobject]</code> created to an instance of the dynamically created class, which stores the data more efficiently. </li> </ul> Note: This workaround comes at the expensive of considerably diminished execution speed. <pre class="prettyprint lang-bash prettyprint-override"><code>$csvFile = 'C:\top-1m.csv' # Dynamically define a custom class derived from the *first* row # read from the CSV file. # Note: While this is a legitimate use of Invoke-Expression, # it should generally be avoided. "class CsvRow { $((Import-Csv $csvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";") }" | Invoke-Expression # Import all rows and convert them from [pscustomobject] instances # to [CsvRow] instances to reduce memory consumption. # Note: Casting the Import-Csv call directly to [CsvRow[]] would be noticeably # faster, but increases *temporary* memory pressure substantially. $alexaTopMillion = Import-Csv $csvFile | ForEach-Object { [CsvRow] $_ } </code></pre> Longer-term, a better solution that would also be faster is to make <code>Import-Csv</code> support outputting parsed rows with a given output type, say, via an <code>-OutputType</code> parameter, as proposed in GitHub issue #8862. If that is of interest to you, show your support for the proposal there. <hr> <h3>Memory-use benchmarks:</h3> The following code compares memory use with normal <code>Import-Csv</code> import (array of <code>[pscustomobject]</code>s) to the workaround (array of custom-class instances). The measurement isn't exact, as the PowerShell's process working memory is simply queried, which can show the influence of background activities, but it gives a rough sense of how much less memory using a custom class requires. Sample output, which shows that the custom-class workaround requires only about one 5th of the memory with the sample 10-column CSV input file with about 166,000 rows used below - the specific ratio depends on the number of input rows and columns: <pre class="prettyprint lang-none prettyprint-override"><code>MB Used Command ------- ------- 384.50 # normal import… 80.48 # import via custom class… </code></pre> Benchmark code: <pre class="prettyprint lang-bash prettyprint-override"><code># Create a sample CSV file with 10 columns about 16 MB in size. $tempCsvFile = [IO.Path]::GetTempFileName() ('"Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10"' + "`n") | Set-Content -NoNewline $tempCsvFile ('"Col1Val","Col2Val","Col3Val","Col4Val","Col5Val","Col6Val","Col7Val","Col8Val","Col9Val","Col10Val"' + "`n") * 1.662e5 | Add-Content $tempCsvFile try { { # normal import $all = Import-Csv $tempCsvFile }, { # import via custom class "class CsvRow { $((Import-Csv $tempCsvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";") }" | Invoke-Expression $all = Import-Csv $tempCsvFile | ForEach-Object { [CsvRow] $_ } } | ForEach-Object { [gc]::Collect(); [gc]::WaitForPendingFinalizers() # garbage-collect first. Start-Sleep 2 # Wait a little for the freed memory to be reflected in the process object. $before = (Get-Process -Id $PID).WorkingSet64 # Execute the command. & $_ # Measure memory consumption and output the result. [pscustomobject] @{ 'MB Used' = ('{0,4:N2}' -f (((Get-Process -Id $PID).WorkingSet64 - $before) / 1mb)).PadLeft(7) Command = $_ } } } finally { Remove-Item $tempCsvFile } </code></pre>

PowerShell: Importing 16MB CSV Into PowerShell Variable Creates >600MB's of PowerShell Memory Usage

Tags:

memory

powershell

memory-profiling

pscustomobject

import-csv

I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'm just trying to understand why it's THAT high. Here's what I'm doing below - just a stripped down simplistic snippet of another script that anyone can run.

Notes/Questions

Not complaining, trying to understand why so much usage and is there a better way of doing this or managing memory more efficiently to respect the system I'm running this on.
This same behavior occurs in PowerShell 5.1 and on PowerShell 7, RC3 just released. I don't think this is a bug, simply another opportunity for me to learn more.
My overall objective with this is to run a foreach loop to check another much smaller array against this array for matches or lack there of.

My Test Code

Invoke-WebRequest -uri "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" -OutFile C:\top-1m.csv.zip

Expand-Archive -Path C:\top-1m.csv.zip -DestinationPath C:\top-1m.csv

$alexaTopMillion = Import-Csv -Path C:\top-1m.csv

To anyone answering this: Thank you for your time and helping me learn more every day!

441

asked Feb 22 '20 15:02

Dave Spatz

2 Answers

Generally speaking, iRon's advice in a comment on the question is worth heeding (the specific question is addressed in the section that follows this one):

To keep memory usage low, use streaming of objects in the pipeline rather than collecting them in memory first - if feasible.

That is, instead of doing this:

# !! Collects ALL objects in memory, as an array.
$rows = Import-Csv in.csv
foreach ($row in $rows) { ... }

do this:

# Process objects ONE BY ONE.
# As long as you stream to a *file* or some other output stream
# (as opposed to assigning to a *variable*), memory use should remain constant,
# except for temporarily held memory awaiting garbage collection.
Import-Csv in.csv | ForEach-Object { ... } # pipe to Export-Csv, for instance

However, even then you seemingly can run out of memory with very large files - see this question - possibly related to buildup of memory from no longer needed objects that haven't yet been garbage-collected; therefore, periodically calling [GC]::Collect() in the ForEach-Object script block may solve the problem - see this answer for an example.

If you do need to collect all objects output by `Import-Csv` in memory at once:

The inordinate memory use you observe comes from how [pscustomobject] instances (Import-Csv's output type) are implemented, as discussed in GitHub issue #7603 (emphasis added):

The memory pressure most likely comes from the cost of PSNoteProperty [which is how [pscustomobject] properties are implemented]. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.

The same issue proposes a workaround to reduce memory consumption (as also shown in Wasif Hasan's answer):

Read the first CVS row and dynamically create a custom class that represents the rows, using Invoke-Expression.
- Note: While its use is safe here, Invoke-Expression should generally be avoided.
- If you know the column structure in advance, you can create a custom class the regular way, which also allows you to use appropriate data types for the properties (which are otherwise all strings by default); for instance, defining appropriate properties as [int] (System.Int32) further reduces memory consumption.
Pipe Import-Csv to a ForEach-Object call that converts each [pscustomobject] created to an instance of the dynamically created class, which stores the data more efficiently.

Note: This workaround comes at the expensive of considerably diminished execution speed.

$csvFile = 'C:\top-1m.csv'

# Dynamically define a custom class derived from the *first* row
# read from the CSV file.
# Note: While this is a legitimate use of Invoke-Expression, 
#       it should generally be avoided.
"class CsvRow { 
 $((Import-Csv $csvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";") 
}" | Invoke-Expression

# Import all rows and convert them from [pscustomobject] instances 
# to [CsvRow] instances to reduce memory consumption.
# Note: Casting the Import-Csv call directly to [CsvRow[]] would be noticeably
#       faster, but increases *temporary* memory pressure substantially.
$alexaTopMillion = Import-Csv $csvFile | ForEach-Object { [CsvRow] $_ }

Longer-term, a better solution that would also be faster is to make Import-Csv support outputting parsed rows with a given output type, say, via an -OutputType parameter, as proposed in GitHub issue #8862.
If that is of interest to you, show your support for the proposal there.

Memory-use benchmarks:

The following code compares memory use with normal Import-Csv import (array of [pscustomobject]s) to the workaround (array of custom-class instances).

The measurement isn't exact, as the PowerShell's process working memory is simply queried, which can show the influence of background activities, but it gives a rough sense of how much less memory using a custom class requires.

Sample output, which shows that the custom-class workaround requires only about one 5th of the memory with the sample 10-column CSV input file with about 166,000 rows used below - the specific ratio depends on the number of input rows and columns:

MB Used Command
------- -------
 384.50  # normal import…
  80.48  # import via custom class…

Benchmark code:

# Create a sample CSV file with 10 columns about 16 MB in size.
$tempCsvFile = [IO.Path]::GetTempFileName()
('"Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10"' + "`n") | Set-Content -NoNewline $tempCsvFile
('"Col1Val","Col2Val","Col3Val","Col4Val","Col5Val","Col6Val","Col7Val","Col8Val","Col9Val","Col10Val"' + "`n") * 1.662e5 |
  Add-Content $tempCsvFile

try {

  { # normal import
    $all = Import-Csv $tempCsvFile
  },
  { # import via custom class
    "class CsvRow {
      $((Import-Csv $tempCsvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
    }" | Invoke-Expression
    $all = Import-Csv $tempCsvFile | ForEach-Object { [CsvRow] $_ }
  } | ForEach-Object {
    [gc]::Collect(); [gc]::WaitForPendingFinalizers() # garbage-collect first.
    Start-Sleep 2 # Wait a little for the freed memory to be reflected in the process object.
    $before = (Get-Process -Id $PID).WorkingSet64
    # Execute the command.
    & $_
    # Measure memory consumption and output the result.
    [pscustomobject] @{
      'MB Used' = ('{0,4:N2}' -f (((Get-Process -Id $PID).WorkingSet64 - $before) / 1mb)).PadLeft(7)
      Command = $_
    }
  }

} finally {
  Remove-Item $tempCsvFile
}

144

answered Nov 15 '22 06:11

mklement0

You can generate a type for the each items as described here https://github.com/PowerShell/PowerShell/issues/7603

Import-Csv "C:\top-1m.csv" | Select-Object -first 1 | ForEach {$_.psobject.properties.name} | Join-String -Separator "`r`n" -OutputPrefix "class MyCsv {`r`n" -OutputSuffix "`n}" -Property {"`t`$$_"}  | Invoke-Expression
Import-Csv "C:\top-1m.csv" | Foreach {[MyCsv]$_} | Export-Csv "C:\alexa_top.csv"

This is quite more efficient. You can measure the time using Measure-Command.

If you use Get-Content it's very very slow. The Raw parameter improves the speed. But memory pressure gets high.

Even ReadCount parameter sets the lines to read per process to read. This is even faster than Using Raw parameter.

It can be even read using Switch statement like:

Switch -File "Path" {default {$_}}

It's even more faster! But regret it even used more memory.

answered Nov 15 '22 07:11

programmer365

Related questions
                            
                                powershell script to remove all content items for french version in sitecore
                            
                                Powershell to unblock all the files in a folder
                            
                                Set a one line powershell for sleep and hibernate to turn off
                            
                                enter-pssession invoke-command, when to use?
                            
                                Get-ADGroup - Group Name, ManagedBy Name and Email
                            
                                Synchronize windows folders
                            
                                How Do You Override a Function in Powershell
                            
                                Fix "Unable to check the status of the firewall" of `Set-WSManQuickConfig`
                            
                                Global variable use case
                            
                                Using Get-ChildItem to retrieve folder,name,fullname
                            
                                PowerShell: Copy-Item fails despite the same process working with Windows Explorer
                            
                                How to silently install exe using powershell
                            
                                Powershell Connection to Oracle Database
                            
                                How can I split a string into an array on every newline?
                            
                                Add SSL Certificate to Windows Docker Container
                            
                                Get a line number on Powershell?
                            
                                how to run powershell script through excel VBA
                            
                                Supporting lexical-scope ScriptBlock parameters (e.g. Where-Object)
                            
                                Robocopy as another user
                            
                                How can I make npm projects with Bash shell commands work on Windows?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

PowerShell: Importing 16MB CSV Into PowerShell Variable Creates >600MB's of PowerShell Memory Usage

Tags:

memory

powershell

memory-profiling

pscustomobject

import-csv

Dave Spatz

People also ask

2 Answers

If you do need to collect all objects output by `Import-Csv` in memory at once:

Memory-use benchmarks:

mklement0

programmer365

Recent Activity

Donate For Us

PowerShell: Importing 16MB CSV Into PowerShell Variable Creates >600MB's of PowerShell Memory Usage

Tags:

memory

powershell

memory-profiling

pscustomobject

import-csv

Dave Spatz

People also ask

2 Answers

If you do need to collect all objects output by Import-Csv in memory at once:

Memory-use benchmarks:

mklement0

programmer365

Related questions

Recent Activity

Donate For Us

If you do need to collect all objects output by `Import-Csv` in memory at once: