I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'm just trying to understand why it's THAT high. Here's what I'm doing below - just a stripped down simplistic snippet of another script that anyone can run.
Notes/Questions
My Test Code
Invoke-WebRequest -uri "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" -OutFile C:\top-1m.csv.zip
Expand-Archive -Path C:\top-1m.csv.zip -DestinationPath C:\top-1m.csv
$alexaTopMillion = Import-Csv -Path C:\top-1m.csv
To anyone answering this: Thank you for your time and helping me learn more every day!
The Import-CSV function converts the CSV data into a custom object in PowerShell. This way we can easily walk through each row of the CSV file and use the data in our scripts.
Generally speaking, iRon's advice in a comment on the question is worth heeding (the specific question is addressed in the section that follows this one):
To keep memory usage low, use streaming of objects in the pipeline rather than collecting them in memory first - if feasible.
That is, instead of doing this:
# !! Collects ALL objects in memory, as an array.
$rows = Import-Csv in.csv
foreach ($row in $rows) { ... }
do this:
# Process objects ONE BY ONE.
# As long as you stream to a *file* or some other output stream
# (as opposed to assigning to a *variable*), memory use should remain constant,
# except for temporarily held memory awaiting garbage collection.
Import-Csv in.csv | ForEach-Object { ... } # pipe to Export-Csv, for instance
However, even then you seemingly can run out of memory with very large files - see this question - possibly related to buildup of memory from no longer needed objects that haven't yet been garbage-collected; therefore, periodically calling [GC]::Collect()
in the ForEach-Object
script block may solve the problem - see this answer for an example.
Import-Csv
in memory at once:The inordinate memory use you observe comes from how [pscustomobject]
instances (Import-Csv
's output type) are implemented, as discussed in GitHub issue #7603 (emphasis added):
The memory pressure most likely comes from the cost of
PSNoteProperty
[which is how[pscustomobject]
properties are implemented]. EachPSNoteProperty
has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.
The same issue proposes a workaround to reduce memory consumption (as also shown in Wasif Hasan's answer):
Read the first CVS row and dynamically create a custom class that represents the rows, using Invoke-Expression
.
Note: While its use is safe here, Invoke-Expression
should generally be avoided.
If you know the column structure in advance, you can create a custom class
the regular way, which also allows you to use appropriate data types for the properties (which are otherwise all strings by default); for instance, defining appropriate properties as [int]
(System.Int32
) further reduces memory consumption.
Pipe Import-Csv
to a ForEach-Object
call that converts each [pscustomobject]
created to an instance of the dynamically created class, which stores the data more efficiently.
Note: This workaround comes at the expensive of considerably diminished execution speed.
$csvFile = 'C:\top-1m.csv'
# Dynamically define a custom class derived from the *first* row
# read from the CSV file.
# Note: While this is a legitimate use of Invoke-Expression,
# it should generally be avoided.
"class CsvRow {
$((Import-Csv $csvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
}" | Invoke-Expression
# Import all rows and convert them from [pscustomobject] instances
# to [CsvRow] instances to reduce memory consumption.
# Note: Casting the Import-Csv call directly to [CsvRow[]] would be noticeably
# faster, but increases *temporary* memory pressure substantially.
$alexaTopMillion = Import-Csv $csvFile | ForEach-Object { [CsvRow] $_ }
Longer-term, a better solution that would also be faster is to make Import-Csv
support outputting parsed rows with a given output type, say, via an -OutputType
parameter, as proposed in GitHub issue #8862.
If that is of interest to you, show your support for the proposal there.
The following code compares memory use with normal Import-Csv
import (array of [pscustomobject]
s) to the workaround (array of custom-class instances).
The measurement isn't exact, as the PowerShell's process working memory is simply queried, which can show the influence of background activities, but it gives a rough sense of how much less memory using a custom class requires.
Sample output, which shows that the custom-class workaround requires only about one 5th of the memory with the sample 10-column CSV input file with about 166,000 rows used below - the specific ratio depends on the number of input rows and columns:
MB Used Command
------- -------
384.50 # normal import…
80.48 # import via custom class…
Benchmark code:
# Create a sample CSV file with 10 columns about 16 MB in size.
$tempCsvFile = [IO.Path]::GetTempFileName()
('"Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10"' + "`n") | Set-Content -NoNewline $tempCsvFile
('"Col1Val","Col2Val","Col3Val","Col4Val","Col5Val","Col6Val","Col7Val","Col8Val","Col9Val","Col10Val"' + "`n") * 1.662e5 |
Add-Content $tempCsvFile
try {
{ # normal import
$all = Import-Csv $tempCsvFile
},
{ # import via custom class
"class CsvRow {
$((Import-Csv $tempCsvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
}" | Invoke-Expression
$all = Import-Csv $tempCsvFile | ForEach-Object { [CsvRow] $_ }
} | ForEach-Object {
[gc]::Collect(); [gc]::WaitForPendingFinalizers() # garbage-collect first.
Start-Sleep 2 # Wait a little for the freed memory to be reflected in the process object.
$before = (Get-Process -Id $PID).WorkingSet64
# Execute the command.
& $_
# Measure memory consumption and output the result.
[pscustomobject] @{
'MB Used' = ('{0,4:N2}' -f (((Get-Process -Id $PID).WorkingSet64 - $before) / 1mb)).PadLeft(7)
Command = $_
}
}
} finally {
Remove-Item $tempCsvFile
}
You can generate a type for the each items as described here https://github.com/PowerShell/PowerShell/issues/7603
Import-Csv "C:\top-1m.csv" | Select-Object -first 1 | ForEach {$_.psobject.properties.name} | Join-String -Separator "`r`n" -OutputPrefix "class MyCsv {`r`n" -OutputSuffix "`n}" -Property {"`t`$$_"} | Invoke-Expression
Import-Csv "C:\top-1m.csv" | Foreach {[MyCsv]$_} | Export-Csv "C:\alexa_top.csv"
This is quite more efficient. You can measure the time using Measure-Command.
If you use Get-Content it's very very slow. The Raw parameter improves the speed. But memory pressure gets high.
Even ReadCount parameter sets the lines to read per process to read. This is even faster than Using Raw parameter.
It can be even read using Switch statement like:
Switch -File "Path" {default {$_}}
It's even more faster! But regret it even used more memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With