Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PowerShell: Importing 16MB CSV Into PowerShell Variable Creates >600MB's of PowerShell Memory Usage

I'm trying to understand why PowerShell's memory balloons so much when I import a file that's ~16MB's as a variable. I can understand there's additional memory structure around that variable but I'm just trying to understand why it's THAT high. Here's what I'm doing below - just a stripped down simplistic snippet of another script that anyone can run.

Notes/Questions

  1. Not complaining, trying to understand why so much usage and is there a better way of doing this or managing memory more efficiently to respect the system I'm running this on.
  2. This same behavior occurs in PowerShell 5.1 and on PowerShell 7, RC3 just released. I don't think this is a bug, simply another opportunity for me to learn more.
  3. My overall objective with this is to run a foreach loop to check another much smaller array against this array for matches or lack there of.

My Test Code

Invoke-WebRequest -uri "http://s3.amazonaws.com/alexa-static/top-1m.csv.zip" -OutFile C:\top-1m.csv.zip

Expand-Archive -Path C:\top-1m.csv.zip -DestinationPath C:\top-1m.csv

$alexaTopMillion = Import-Csv -Path C:\top-1m.csv

To anyone answering this: Thank you for your time and helping me learn more every day!

like image 441
Dave Spatz Avatar asked Feb 22 '20 15:02

Dave Spatz


People also ask

What does import CSV do in PowerShell?

The Import-CSV function converts the CSV data into a custom object in PowerShell. This way we can easily walk through each row of the CSV file and use the data in our scripts.


2 Answers

Generally speaking, iRon's advice in a comment on the question is worth heeding (the specific question is addressed in the section that follows this one):

To keep memory usage low, use streaming of objects in the pipeline rather than collecting them in memory first - if feasible.

That is, instead of doing this:

# !! Collects ALL objects in memory, as an array.
$rows = Import-Csv in.csv
foreach ($row in $rows) { ... }

do this:

# Process objects ONE BY ONE.
# As long as you stream to a *file* or some other output stream
# (as opposed to assigning to a *variable*), memory use should remain constant,
# except for temporarily held memory awaiting garbage collection.
Import-Csv in.csv | ForEach-Object { ... } # pipe to Export-Csv, for instance

However, even then you seemingly can run out of memory with very large files - see this question - possibly related to buildup of memory from no longer needed objects that haven't yet been garbage-collected; therefore, periodically calling [GC]::Collect() in the ForEach-Object script block may solve the problem - see this answer for an example.


If you do need to collect all objects output by Import-Csv in memory at once:

The inordinate memory use you observe comes from how [pscustomobject] instances (Import-Csv's output type) are implemented, as discussed in GitHub issue #7603 (emphasis added):

The memory pressure most likely comes from the cost of PSNoteProperty [which is how [pscustomobject] properties are implemented]. Each PSNoteProperty has an overhead of 48 bytes, so when you just store a few bytes per property, that becomes massive.

The same issue proposes a workaround to reduce memory consumption (as also shown in Wasif Hasan's answer):

  • Read the first CVS row and dynamically create a custom class that represents the rows, using Invoke-Expression.

    • Note: While its use is safe here, Invoke-Expression should generally be avoided.

    • If you know the column structure in advance, you can create a custom class the regular way, which also allows you to use appropriate data types for the properties (which are otherwise all strings by default); for instance, defining appropriate properties as [int] (System.Int32) further reduces memory consumption.

  • Pipe Import-Csv to a ForEach-Object call that converts each [pscustomobject] created to an instance of the dynamically created class, which stores the data more efficiently.

Note: This workaround comes at the expensive of considerably diminished execution speed.

$csvFile = 'C:\top-1m.csv'

# Dynamically define a custom class derived from the *first* row
# read from the CSV file.
# Note: While this is a legitimate use of Invoke-Expression, 
#       it should generally be avoided.
"class CsvRow { 
 $((Import-Csv $csvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";") 
}" | Invoke-Expression

# Import all rows and convert them from [pscustomobject] instances 
# to [CsvRow] instances to reduce memory consumption.
# Note: Casting the Import-Csv call directly to [CsvRow[]] would be noticeably
#       faster, but increases *temporary* memory pressure substantially.
$alexaTopMillion = Import-Csv $csvFile | ForEach-Object { [CsvRow] $_ }

Longer-term, a better solution that would also be faster is to make Import-Csv support outputting parsed rows with a given output type, say, via an -OutputType parameter, as proposed in GitHub issue #8862.
If that is of interest to you, show your support for the proposal there.


Memory-use benchmarks:

The following code compares memory use with normal Import-Csv import (array of [pscustomobject]s) to the workaround (array of custom-class instances).

The measurement isn't exact, as the PowerShell's process working memory is simply queried, which can show the influence of background activities, but it gives a rough sense of how much less memory using a custom class requires.

Sample output, which shows that the custom-class workaround requires only about one 5th of the memory with the sample 10-column CSV input file with about 166,000 rows used below - the specific ratio depends on the number of input rows and columns:

MB Used Command
------- -------
 384.50  # normal import…
  80.48  # import via custom class…

Benchmark code:

# Create a sample CSV file with 10 columns about 16 MB in size.
$tempCsvFile = [IO.Path]::GetTempFileName()
('"Col1","Col2","Col3","Col4","Col5","Col6","Col7","Col8","Col9","Col10"' + "`n") | Set-Content -NoNewline $tempCsvFile
('"Col1Val","Col2Val","Col3Val","Col4Val","Col5Val","Col6Val","Col7Val","Col8Val","Col9Val","Col10Val"' + "`n") * 1.662e5 |
  Add-Content $tempCsvFile

try {

  { # normal import
    $all = Import-Csv $tempCsvFile
  },
  { # import via custom class
    "class CsvRow {
      $((Import-Csv $tempCsvFile | Select-Object -first 1).psobject.properties.Name -replace '^', '[string] $$' -join ";")
    }" | Invoke-Expression
    $all = Import-Csv $tempCsvFile | ForEach-Object { [CsvRow] $_ }
  } | ForEach-Object {
    [gc]::Collect(); [gc]::WaitForPendingFinalizers() # garbage-collect first.
    Start-Sleep 2 # Wait a little for the freed memory to be reflected in the process object.
    $before = (Get-Process -Id $PID).WorkingSet64
    # Execute the command.
    & $_
    # Measure memory consumption and output the result.
    [pscustomobject] @{
      'MB Used' = ('{0,4:N2}' -f (((Get-Process -Id $PID).WorkingSet64 - $before) / 1mb)).PadLeft(7)
      Command = $_
    }
  }

} finally {
  Remove-Item $tempCsvFile
}
like image 144
mklement0 Avatar answered Nov 15 '22 06:11

mklement0


You can generate a type for the each items as described here https://github.com/PowerShell/PowerShell/issues/7603

Import-Csv "C:\top-1m.csv" | Select-Object -first 1 | ForEach {$_.psobject.properties.name} | Join-String -Separator "`r`n" -OutputPrefix "class MyCsv {`r`n" -OutputSuffix "`n}" -Property {"`t`$$_"}  | Invoke-Expression
Import-Csv "C:\top-1m.csv" | Foreach {[MyCsv]$_} | Export-Csv "C:\alexa_top.csv"

This is quite more efficient. You can measure the time using Measure-Command.

If you use Get-Content it's very very slow. The Raw parameter improves the speed. But memory pressure gets high.

Even ReadCount parameter sets the lines to read per process to read. This is even faster than Using Raw parameter.

It can be even read using Switch statement like:

Switch -File "Path" {default {$_}}

It's even more faster! But regret it even used more memory.

like image 33
programmer365 Avatar answered Nov 15 '22 07:11

programmer365