Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to process a file in PowerShell line-by-line as a stream

I'm working with some multi-gigabyte text files and want to do some stream processing on them using PowerShell. It's simple stuff, just parsing each line and pulling out some data, then storing it in a database.

Unfortunately, get-content | %{ whatever($_) } appears to keep the entire set of lines at this stage of the pipe in memory. It's also surprisingly slow, taking a very long time to actually read it all in.

So my question is two parts:

  1. How can I make it process the stream line by line and not keep the entire thing buffered in memory? I would like to avoid using up several gigs of RAM for this purpose.
  2. How can I make it run faster? PowerShell iterating over a get-content appears to be 100x slower than a C# script.

I'm hoping there's something dumb I'm doing here, like missing a -LineBufferSize parameter or something...

like image 800
scobi Avatar asked Nov 16 '10 07:11

scobi


People also ask

How do I read a text file line by line in PowerShell?

File]::ReadLines("C:\path\to\file. txt") | ForEach-Object { ... } . The foreach statement will load the entire collection to an object. ForEach-Object uses a pipeline to stream with.

How do I pipe a file in PowerShell?

Use the Tee-Object cmdlet, which sends command output to a text file and then sends it to the pipeline. Use the PowerShell redirection operators. Using the redirection operator with a file target is functionally equivalent to piping to Out-File with no extra parameters.

How do I read the contents of a file in PowerShell?

When you want to read the entire contents of a text file, the easiest way is to use the built-in Get-Content function. When you execute this command, the contents of this file will be displayed in your command prompt or the PowerShell ISE screen, depending on where you execute it.


1 Answers

If you are really about to work on multi-gigabyte text files then do not use PowerShell. Even if you find a way to read it faster processing of huge amount of lines will be slow in PowerShell anyway and you cannot avoid this. Even simple loops are expensive, say for 10 million iterations (quite real in your case) we have:

# "empty" loop: takes 10 seconds measure-command { for($i=0; $i -lt 10000000; ++$i) {} }  # "simple" job, just output: takes 20 seconds measure-command { for($i=0; $i -lt 10000000; ++$i) { $i } }  # "more real job": 107 seconds measure-command { for($i=0; $i -lt 10000000; ++$i) { $i.ToString() -match '1' } } 

UPDATE: If you are still not scared then try to use the .NET reader:

$reader = [System.IO.File]::OpenText("my.log") try {     for() {         $line = $reader.ReadLine()         if ($line -eq $null) { break }         # process the line         $line     } } finally {     $reader.Close() } 

UPDATE 2

There are comments about possibly better / shorter code. There is nothing wrong with the original code with for and it is not pseudo-code. But the shorter (shortest?) variant of the reading loop is

$reader = [System.IO.File]::OpenText("my.log") while($null -ne ($line = $reader.ReadLine())) {     $line } 
like image 98
Roman Kuzmin Avatar answered Oct 06 '22 06:10

Roman Kuzmin