I have a problem using TRegEx.replace
:
var
Value, Pattern, Replace: string;
begin
Value := 'my_replace_string(4)=my_replace_string(5)';
Pattern := 'my_replace_string\((\d+)\)';
Replace := 'new_value(\1)';
Value := TRegEx.Replace(Value, Pattern, Replace);
ShowMessage(Value);
end;
The expected result would be new_value(4)=new_value(5)
, while my code (compiled with Delphi XE4) gives new_value(4)=new_value()1)
With Notepad++, I get the expected result.
Using a named group makes it clear the 1
is the backreference treated literally:
Pattern := 'my_replace_string\((?<name>\d+)\)';
Replace := 'new_value(${name})';
// Result: 'new_value(4)=new_value(){name})'
The replacement is always that simple (could be zero or more times my_replace_string
), so I could easily create a custom search-and-replace function, but I'd like to know what goes on here.
Is it my fault or is it a bug?
I can reproduce the bug in Delphi XE4. I get the correct behavior in Delphi XE5.
The bug was in TPerlRegEx.ComputeReplacement
. The code that I contributed to Embarcadero for inclusion with Delphi XE3 used UTF8String
. With Delphi XE4 Embarcadero eliminated UTF8String
from the RegularExpressionsCore
unit and replaced it with TBytes
. The developer that made this change seems to have missed a crucial difference between strings and dynamic arrays in Delphi. Strings use a copy-on-write mechanism, while dynamic arrays do not.
So in my original code, TPerlRegEx.ComputeReplacement
could do S := FReplacement
and then modify the temporary variable S
to substitute backreferences without affecting the FReplacement
field because both were strings. In the modified code, S := FReplacement
makes S
point to the same array as FReplacement
and when backreferences in S
are substituted, FReplacement
is also modified. Hence the first replacement is made correctly, while following replacements are wrong because FReplacement
was crippled.
In Delphi XE5 this was fixed by replacing S := FReplacement
with this to make a proper temporary copy:
SetLength(S, Length(FReplacement));
Move(FReplacement[0], S[0], Length(FReplacement));
When Delphi 2009 was released there was a lot of talk from Embarcadero that one shouldn't use string types to represent sequences of bytes. It seems they are now making the opposite mistake of using TBytes to represent strings.
The solution to this whole mess, which I have previously recommended to Embarcadero, is to switch to the new pcre16 functions which use UTF16LE just like Delphi strings. These functions did not exist when Delphi XE was released, but they do now and they should be used.
It would appear to be a bug. Here's my test program:
{$APPTYPE CONSOLE}
uses
RegularExpressions;
var
Value, Pattern, Replace: string;
begin
Value := 'my_replace_string(4)=my_replace_string(5)';
Pattern := 'my_replace_string\((\d+)\)';
Replace := 'new_value(\1)';
Value := TRegEx.Replace(Value, Pattern, Replace);
Writeln(Value);
Readln;
end.
On my XE3 the output is:
new_value(4)=new_value(5)
So it looks like the bug was introduced in XE4. I suggest that you submit a QC report. Use my SSCCE above because it is standalone.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With