Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I make this PowerShell script parse large files faster?

Tags:

powershell

I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are expected. So I have a few questions.

The script below works, but it takes a very long time to process even a 30 MB file.

PowerShell Script:

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$array = @()

$content = gc $path\$infile |
    select -skip 4 |
    where {$_ -match "[|].*[|].*"} |
    foreach {$_ -replace "^[|]","" -replace "[|]$",""}

$header = $content[0]

$array = $content[0]
for ($i = 1; $i -le $content.length; $i+=1) {
    if ($array[$i] -ne $content[0]) {$array += $content[$i]}
}

$array | out-file $path\$outfile -encoding ASCII

DataFile Excerpt:

---------------------------
|Data statistics|Number of|
|-------------------------|
|Records passed |   93,118|
---------------------------
02/14/2012                                                                                                                                                           Production Operations and Confirmations                                                                                                                                                              2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Production Operations and Confirmations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|ProductionOrderNumber|MaterialNumber                       |ModifiedDate|Plant|OperationRoutingNumber|WorkCenter|OperationStatus|IsActive|     WbsElement|SequenceNumber|OperationNumber|OperationDescription                    |OperationQty|ConfirmedYieldQty|StandardValueLabor|ActualDirectLaborHrs|ActualContractorLaborHrs|ActualOvertimeLaborHrs|ConfirmationNumber|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|180849518            |011255486L1                          |02/08/2012  |2101 |            9901123118|56B30     |I9902          |        |SOC10MA2302SOCJ31|              |0140           |Operation 1                             |          1 |               0 |              0.0 |                    |                499.990 |                      |        9908651250|
|180849518            |011255486L1                          |02/08/2012  |2101 |            9901123118|56B30     |I9902          |        |SOC10MA2302SOCJ31|14            |9916           |Operation 2                             |          1 |               0 |            499.0 |                    |                        |                      |        9908532289|
|181993564            |011255486L1                          |02/09/2012  |2101 |            9901288820|56B30     |I9902          |        |SOC10MD2302SOCJ31|14            |9916           |Operation 1                             |          1 |               0 |            499.0 |                    |                399.599 |                      |        9908498544|
|180885825            |011255486L1                          |02/08/2012  |2101 |            9901162239|56B30     |I9902          |        |SOC10MG2302SOCJ31|              |0150           |Operation 3                             |          1 |               0 |              0.0 |                    |                882.499 |                      |        9908099659|
|180885825            |011255486L1                          |02/08/2012  |2101 |            9901162239|56B30     |I9902          |        |SOC10MG2302SOCJ31|14            |9916           |Operation 4                             |          1 |               0 |            544.0 |                    |                        |                      |        9908858514|
|181638583            |990104460I0                          |02/10/2012  |2101 |            9902123289|56G99     |I9902          |        |SOC11MAR105SOCJ31|              |0160           |Operation 5                             |          1 |               0 |          1,160.0 |                    |                        |                      |        9914295010|
|181681218            |990104460B0                          |02/08/2012  |2101 |            9902180981|56G99     |I9902          |        |SOC11MAR328SOCJ31|0             |9910           |Operation 6                             |          1 |               0 |            916.0 |                    |                        |                      |        9914621885|
|181681036            |990104460I0                          |02/09/2012  |2101 |            9902180289|56G99     |I9902          |        |SOC11MAR108SOCJ31|              |0180           |Operation 8                             |          1 |               0 |              1.0 |                    |                        |                      |        9914619196|
|189938054            |011255486A2                          |02/10/2012  |2101 |            9999206805|5AD99     |I9902          |        |RS08MJ2305SOCJ31 |              |0599           |Operation 8                             |          1 |               0 |              0.0 |                    |                        |                      |        9901316289|
|181919894            |012984532A3                          |02/10/2012  |2101 |            9902511433|A199399Z  |I9902          |        |SOC12MCB101SOCJ31|0             |9935           |Operation 9                             |          1 |               0 |              0.5 |                    |                        |                      |        9916914233|
|181919894            |012984532A3                          |02/10/2012  |2101 |            9902511433|A199399Z  |I9902          |        |SOC12MCB101SOCJ31|22            |9951           |Operation 10                            |          1 |               0 |           68.080 |                    |                        |                      |        9916914224|
like image 992
shawno Avatar asked Feb 24 '12 22:02

shawno


People also ask

How big of a file can PowerShell parse for ETL?

I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are expected.

What do you do to handle large PowerShell scripts?

What are some other things you've done to handle large PowerShell scripts? The first thing I try to make sure of is that there's some documentation and help built in. WTFM I also love to make use of variables as much as possible, helps shorten the code quite a bit.

Does anyone else write PowerShell scripts this way?

I don't think anyone  writes scripts that way. The script is as large as it needs to be. One of the tenants of PowerShell, however, is to not necessarily process the data for reporting. Just output objects and let the user decide what they want to do with it (in this case, the user is YOU).

How can I speed up my streamreader?

Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources. Show activity on this post. The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files.


2 Answers

Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).

Try this (not tested extensively):

$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"

$batch = 1000

[regex]$match_regex = '^\|.+\|.+\|.+'
[regex]$replace_regex = '^\|(.+)\|$'

$header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line

[regex]$header_regex = [regex]::escape($header_line)

$header_line.trim('|') | Set-Content $path\$outfile

Get-Content $path\$infile -ReadCount $batch |
    ForEach {
             $_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
    }

That's a compromise between memory usage and speed. The -match and -replace operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.

like image 127
mjolinor Avatar answered Oct 29 '22 08:10

mjolinor


The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line using a StreamReader like this:

$path = 'C:\A-Very-Large-File.txt'
$r = [IO.File]::OpenText($path)
while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()
    # Process $line here...
}
$r.Dispose()

Some performance comparisons:

Measure-Command {Get-Content .\512MB.txt > $null}

Total Seconds: 49.4742533

Measure-Command {
    $r = [IO.File]::OpenText('512MB.txt')
    while ($r.Peek() -ge 0) {
        $r.ReadLine() > $null
    }
    $r.Dispose()
}

Total Seconds: 27.666803

like image 41
Andy Arismendi Avatar answered Oct 29 '22 06:10

Andy Arismendi