Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PowerShell multiple string replacement efficiency

I'm trying to replace 600 different strings in a very large text file 30Mb+. I'm current building a script that does this; following this Question:

Script:

$string = gc $filePath 
$string | % {
    $_ -replace 'something0','somethingelse0' `
       -replace 'something1','somethingelse1' `
       -replace 'something2','somethingelse2' `
       -replace 'something3','somethingelse3' `
       -replace 'something4','somethingelse4' `
       -replace 'something5','somethingelse5' `
       ...
       (600 More Lines...)
       ...
}
$string | ac "C:\log.txt"

But as this will check each line 600 times and there are well over 150,000+ lines in the text file this means there’s a lot of processing time.

Is there a better alternative to doing this that is more efficient?

like image 416
Richard Avatar asked Jul 17 '13 14:07

Richard


2 Answers

Combining the hash technique from Adi Inbar's answer, and the match evaluator from Keith Hill's answer to another recent question, here is how you can perform the replace in PowerShell:

# Build hashtable of search and replace values.
$replacements = @{
  'something0' = 'somethingelse0'
  'something1' = 'somethingelse1'
  'something2' = 'somethingelse2'
  'something3' = 'somethingelse3'
  'something4' = 'somethingelse4'
  'something5' = 'somethingelse5'
  'X:\Group_14\DACU' = '\\DACU$'
  '.*[^xyz]' = 'oO{xyz}'
  'moresomethings' = 'moresomethingelses'
}

# Join all (escaped) keys from the hashtable into one regular expression.
[regex]$r = @($replacements.Keys | foreach { [regex]::Escape( $_ ) }) -join '|'

[scriptblock]$matchEval = { param( [Text.RegularExpressions.Match]$matchInfo )
  # Return replacement value for each matched value.
  $matchedValue = $matchInfo.Groups[0].Value
  $replacements[$matchedValue]
}

# Perform replace over every line in the file and append to log.
Get-Content $filePath |
  foreach { $r.Replace( $_, $matchEval ) } |
  Add-Content 'C:\log.txt'
like image 91
Emperor XLII Avatar answered Sep 27 '22 16:09

Emperor XLII


So, what you're saying is that you want to replace any of 600 strings in each of 150,000 lines, and you want to run one replace operation per line?

Yes, there is a way to do it, but not in PowerShell, at least I can't think of one. It can be done in Perl.


The Method:

  1. Construct a hash where the keys are the somethings and the values are the somethingelses.
  2. Join the keys of the hash with the | symbol, and use it as a match group in the regex.
  3. In the replacement, interpolate an expression that retrieves a value from the hash using the match variable for the capture group

The Problem:

Frustratingly, PowerShell doesn't expose the match variables outside the regex replace call. It doesn't work with the -replace operator and it doesn't work with [regex]::replace.

In Perl, you can do this, for example:

$string =~ s/(1|2|3)/@{[$1 + 5]}/g;

This will add 5 to the digits 1, 2, and 3 throughout the string, so if the string is "1224526123 [2] [6]", it turns into "6774576678 [7] [6]".

However, in PowerShell, both of these fail:

$string -replace '(1|2|3)',"$($1 + 5)"

[regex]::replace($string,'(1|2|3)',"$($1 + 5)")

In both cases, $1 evaluates to null, and the expression evaluates to plain old 5. The match variables in replacements are only meaningful in the resulting string, i.e. a single-quoted string or whatever the double-quoted string evaluates to. They're basically just backreferences that look like match variables. Sure, you can quote the $ before the number in a double-quoted string, so it will evaluate to the corresponding match group, but that defeats the purpose - it can't participate in an expression.


The Solution:

[This answer has been modified from the original. It has been formatted to fit match strings with regex metacharacters. And your TV screen, of course.]

If using another language is acceptable to you, the following Perl script works like a charm:

$filePath = $ARGV[0]; # Or hard-code it or whatever
open INPUT, "< $filePath";
open OUTPUT, '> C:\log.txt';
%replacements = (
  'something0' => 'somethingelse0',
  'something1' => 'somethingelse1',
  'something2' => 'somethingelse2',
  'something3' => 'somethingelse3',
  'something4' => 'somethingelse4',
  'something5' => 'somethingelse5',
  'X:\Group_14\DACU' => '\\DACU$',
  '.*[^xyz]' => 'oO{xyz}',
  'moresomethings' => 'moresomethingelses'
);
foreach (keys %replacements) {
  push @strings, qr/\Q$_\E/;
  $replacements{$_} =~ s/\\/\\\\/g;
}
$pattern = join '|', @strings;
while (<INPUT>) {
  s/($pattern)/$replacements{$1}/g;
  print OUTPUT;
}
close INPUT;
close OUTPUT;

It searches for the keys of the hash (left of the =>), and replaces them with the corresponding values. Here's what's happening:

  • The foreach loop goes through all the elements of the hash and create an array called @strings that contains the keys of the %replacements hash, with metacharacters quoted using \Q and \E, and the result of that quoted for use as a regex pattern (qr = quote regex). In the same pass, it escapes all the backslashes in the replacement strings by doubling them.
  • Next, the elements of the array are joined with |'s to form the search pattern. You could include the grouping parentheses in $pattern if you want, but I think this way makes it clearer what's happening.
  • The while loop reads each line from the input file, replaces any of the strings in the search pattern with the corresponding replacement strings in the hash, and writes the line to the output file.

BTW, you might have noticed several other modifications from the original script. My Perl has collected some dust during my recent PowerShell kick, and on a second look I noticed several things that could be done better.

  • while (<INPUT>) reads the file one line at a time. A lot more sensible than reading the entire 150,000 lines into an array, especially when your goal is efficiency.
  • I simplified @{[$replacements{$1}]} to $replacements{$1}. Perl doesn't have a built-in way of interpolating expressions like PowerShell's $(), so @{[ ]} is used as a workaround - it creates a literal array of one element containing the expression. But I realized that it's not necessary if the expression is just a single scalar variable (I had it in there as a holdover from my initial testing, where I was applying calculations to the $1 match variable).
  • The close statements aren't strictly necessary, but it's considered good practice to explicitly close your filehandles.
  • I changed the for abbreviation to foreach, to make it clearer and more familiar to PowerShell programmers.
like image 41
Adi Inbar Avatar answered Sep 27 '22 15:09

Adi Inbar