Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delphi TRegEx backreference broken?

I have a problem using TRegEx.replace:

var
  Value, Pattern, Replace: string;
begin
  Value   := 'my_replace_string(4)=my_replace_string(5)';
  Pattern := 'my_replace_string\((\d+)\)';
  Replace := 'new_value(\1)';
  Value   := TRegEx.Replace(Value, Pattern, Replace);
  ShowMessage(Value);
end;

The expected result would be new_value(4)=new_value(5), while my code (compiled with Delphi XE4) gives new_value(4)=new_value()1)

With Notepad++, I get the expected result.

Using a named group makes it clear the 1 is the backreference treated literally:

Pattern := 'my_replace_string\((?<name>\d+)\)';
Replace := 'new_value(${name})';
// Result: 'new_value(4)=new_value(){name})'

The replacement is always that simple (could be zero or more times my_replace_string), so I could easily create a custom search-and-replace function, but I'd like to know what goes on here.

Is it my fault or is it a bug?

like image 449
Pharaoh Avatar asked Jan 07 '14 11:01

Pharaoh


2 Answers

I can reproduce the bug in Delphi XE4. I get the correct behavior in Delphi XE5.

The bug was in TPerlRegEx.ComputeReplacement. The code that I contributed to Embarcadero for inclusion with Delphi XE3 used UTF8String. With Delphi XE4 Embarcadero eliminated UTF8String from the RegularExpressionsCore unit and replaced it with TBytes. The developer that made this change seems to have missed a crucial difference between strings and dynamic arrays in Delphi. Strings use a copy-on-write mechanism, while dynamic arrays do not.

So in my original code, TPerlRegEx.ComputeReplacement could do S := FReplacement and then modify the temporary variable S to substitute backreferences without affecting the FReplacement field because both were strings. In the modified code, S := FReplacement makes S point to the same array as FReplacement and when backreferences in S are substituted, FReplacement is also modified. Hence the first replacement is made correctly, while following replacements are wrong because FReplacement was crippled.

In Delphi XE5 this was fixed by replacing S := FReplacement with this to make a proper temporary copy:

SetLength(S, Length(FReplacement));
Move(FReplacement[0], S[0], Length(FReplacement));

When Delphi 2009 was released there was a lot of talk from Embarcadero that one shouldn't use string types to represent sequences of bytes. It seems they are now making the opposite mistake of using TBytes to represent strings.

The solution to this whole mess, which I have previously recommended to Embarcadero, is to switch to the new pcre16 functions which use UTF16LE just like Delphi strings. These functions did not exist when Delphi XE was released, but they do now and they should be used.

like image 169
Jan Goyvaerts Avatar answered Oct 04 '22 02:10

Jan Goyvaerts


It would appear to be a bug. Here's my test program:

{$APPTYPE CONSOLE}

uses
  RegularExpressions;

var
  Value, Pattern, Replace: string;
begin
  Value   := 'my_replace_string(4)=my_replace_string(5)';
  Pattern := 'my_replace_string\((\d+)\)';
  Replace := 'new_value(\1)';
  Value   := TRegEx.Replace(Value, Pattern, Replace);
  Writeln(Value);
  Readln;
end.

On my XE3 the output is:

new_value(4)=new_value(5)

So it looks like the bug was introduced in XE4. I suggest that you submit a QC report. Use my SSCCE above because it is standalone.

like image 21
David Heffernan Avatar answered Oct 04 '22 02:10

David Heffernan