I have 265 CSV files with over 4 million total records (lines), and need to do a search and replace in all the CSV files. I have a snippet of my PowerShell code below that does this, but it takes 17 minutes to perform the action:
ForEach ($file in Get-ChildItem C:\temp\csv\*.csv)
{
$content = Get-Content -path $file
$content | foreach {$_ -replace $SearchStr, $ReplaceStr} | Set-Content $file
}
Now I have the following Python code that does the same thing but takes less than 1 minute to perform:
import os, fnmatch
def findReplace(directory, find, replace, filePattern):
for path, dirs, files in os.walk(os.path.abspath(directory)):
for filename in fnmatch.filter(files, filePattern):
filepath = os.path.join(path, filename)
with open(filepath) as f:
s = f.read()
s = s.replace(find, replace)
with open(filepath, "w") as f:
f.write(s)
findReplace("c:/temp/csv", "Search String", "Replace String", "*.csv")
Why is the Python method so much more efficient? Is my PowerShell code in-efficient, or is Python just a more powerful programming language when it comes to text manipulation?
Give this PowerShell script a try. It should perform much better. Much less use of RAM too as the file is read in a buffered stream.
$reader = [IO.File]::OpenText("C:\input.csv")
$writer = New-Object System.IO.StreamWriter("C:\output.csv")
while ($reader.Peek() -ge 0) {
$line = $reader.ReadLine()
$line2 = $line -replace $SearchStr, $ReplaceStr
$writer.writeline($line2)
}
$reader.Close()
$writer.Close()
This processes one file, but you can test performance with it and if its more acceptable add it to a loop.
Alternatively you can use Get-Content
to read a number of lines into memory, perform the replacement and then write the updated chunk utilizing the PowerShell pipeline.
Get-Content "C:\input.csv" -ReadCount 512 | % {
$_ -replace $SearchStr, $ReplaceStr
} | Set-Content "C:\output.csv"
To squeeze a little more performance you can also compile the regex (-replace
uses regular expressions) like this:
$re = New-Object Regex $SearchStr, 'Compiled'
$re.Replace( $_ , $ReplaceStr )
I see this a lot:
$content | foreach {$_ -replace $SearchStr, $ReplaceStr}
The -replace operator will handle an entire array at once:
$content -replace $SearchStr, $ReplaceStr
and do it a lot faster than iterating through one element at a time. I suspect doing that may get you closer to an apples-to-apples comparison.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With