I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are expected. So I have a few questions.
The script below works, but it takes a very long time to process even a 30 MB file.
$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$array = @()
$content = gc $path\$infile |
select -skip 4 |
where {$_ -match "[|].*[|].*"} |
foreach {$_ -replace "^[|]","" -replace "[|]$",""}
$header = $content[0]
$array = $content[0]
for ($i = 1; $i -le $content.length; $i+=1) {
if ($array[$i] -ne $content[0]) {$array += $content[$i]}
}
$array | out-file $path\$outfile -encoding ASCII
---------------------------
|Data statistics|Number of|
|-------------------------|
|Records passed | 93,118|
---------------------------
02/14/2012 Production Operations and Confirmations 2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Production Operations and Confirmations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|ProductionOrderNumber|MaterialNumber |ModifiedDate|Plant|OperationRoutingNumber|WorkCenter|OperationStatus|IsActive| WbsElement|SequenceNumber|OperationNumber|OperationDescription |OperationQty|ConfirmedYieldQty|StandardValueLabor|ActualDirectLaborHrs|ActualContractorLaborHrs|ActualOvertimeLaborHrs|ConfirmationNumber|
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|180849518 |011255486L1 |02/08/2012 |2101 | 9901123118|56B30 |I9902 | |SOC10MA2302SOCJ31| |0140 |Operation 1 | 1 | 0 | 0.0 | | 499.990 | | 9908651250|
|180849518 |011255486L1 |02/08/2012 |2101 | 9901123118|56B30 |I9902 | |SOC10MA2302SOCJ31|14 |9916 |Operation 2 | 1 | 0 | 499.0 | | | | 9908532289|
|181993564 |011255486L1 |02/09/2012 |2101 | 9901288820|56B30 |I9902 | |SOC10MD2302SOCJ31|14 |9916 |Operation 1 | 1 | 0 | 499.0 | | 399.599 | | 9908498544|
|180885825 |011255486L1 |02/08/2012 |2101 | 9901162239|56B30 |I9902 | |SOC10MG2302SOCJ31| |0150 |Operation 3 | 1 | 0 | 0.0 | | 882.499 | | 9908099659|
|180885825 |011255486L1 |02/08/2012 |2101 | 9901162239|56B30 |I9902 | |SOC10MG2302SOCJ31|14 |9916 |Operation 4 | 1 | 0 | 544.0 | | | | 9908858514|
|181638583 |990104460I0 |02/10/2012 |2101 | 9902123289|56G99 |I9902 | |SOC11MAR105SOCJ31| |0160 |Operation 5 | 1 | 0 | 1,160.0 | | | | 9914295010|
|181681218 |990104460B0 |02/08/2012 |2101 | 9902180981|56G99 |I9902 | |SOC11MAR328SOCJ31|0 |9910 |Operation 6 | 1 | 0 | 916.0 | | | | 9914621885|
|181681036 |990104460I0 |02/09/2012 |2101 | 9902180289|56G99 |I9902 | |SOC11MAR108SOCJ31| |0180 |Operation 8 | 1 | 0 | 1.0 | | | | 9914619196|
|189938054 |011255486A2 |02/10/2012 |2101 | 9999206805|5AD99 |I9902 | |RS08MJ2305SOCJ31 | |0599 |Operation 8 | 1 | 0 | 0.0 | | | | 9901316289|
|181919894 |012984532A3 |02/10/2012 |2101 | 9902511433|A199399Z |I9902 | |SOC12MCB101SOCJ31|0 |9935 |Operation 9 | 1 | 0 | 0.5 | | | | 9916914233|
|181919894 |012984532A3 |02/10/2012 |2101 | 9902511433|A199399Z |I9902 | |SOC12MCB101SOCJ31|22 |9951 |Operation 10 | 1 | 0 | 68.080 | | | | 9916914224|
I have the following PowerShell script that will parse some very large file for ETL purposes. For starters my test file is ~ 30 MB. Larger files around 200 MB are expected.
What are some other things you've done to handle large PowerShell scripts? The first thing I try to make sure of is that there's some documentation and help built in. WTFM I also love to make use of variables as much as possible, helps shorten the code quite a bit.
I don't think anyone writes scripts that way. The script is as large as it needs to be. One of the tenants of PowerShell, however, is to not necessarily process the data for reporting. Just output objects and let the user decide what they want to do with it (in this case, the user is YOU).
Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources. Show activity on this post. The Get-Content cmdlet does not perform as well as a StreamReader when dealing with very large files.
Your script reads one line at a time (slow!) and stores almost the entire file in memory (big!).
Try this (not tested extensively):
$path = "E:\Documents\Projects\ESPS\Dev\DataFiles\DimProductionOrderOperation"
$infile = "14SEP11_ProdOrderOperations.txt"
$outfile = "PROCESSED_14SEP11_ProdOrderOperations.txt"
$batch = 1000
[regex]$match_regex = '^\|.+\|.+\|.+'
[regex]$replace_regex = '^\|(.+)\|$'
$header_line = (Select-String -Path $path\$infile -Pattern $match_regex -list).line
[regex]$header_regex = [regex]::escape($header_line)
$header_line.trim('|') | Set-Content $path\$outfile
Get-Content $path\$infile -ReadCount $batch |
ForEach {
$_ -match $match_regex -NotMatch $header_regex -Replace $replace_regex ,'$1' | Out-File $path\$outfile -Append
}
That's a compromise between memory usage and speed. The -match
and -replace
operators will work on an array, so you can filter and replace an entire array at once without having to foreach through every record. The -readcount
will cause the file to be read in chunks of $batch records, so you're basically reading in 1000 records at a time, doing the match and replace on that batch then appending the result to your output file. Then it goes back for the next 1000 records. Increasing the size of $batch should speed it up, but it will make it use more memory. Adjust that to suit your resources.
The Get-Content
cmdlet does not perform as well as a StreamReader when dealing with very large files. You can read a file line by line using a StreamReader like this:
$path = 'C:\A-Very-Large-File.txt'
$r = [IO.File]::OpenText($path)
while ($r.Peek() -ge 0) {
$line = $r.ReadLine()
# Process $line here...
}
$r.Dispose()
Some performance comparisons:
Measure-Command {Get-Content .\512MB.txt > $null}
Total Seconds: 49.4742533
Measure-Command {
$r = [IO.File]::OpenText('512MB.txt')
while ($r.Peek() -ge 0) {
$r.ReadLine() > $null
}
$r.Dispose()
}
Total Seconds: 27.666803
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With