Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bug in .net Regex.Replace?

Tags:

.net

regex

The following code...

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var r = new Regex("(.*)");
        var c = "XYZ";
        var uc = r.Replace(c, "A $1 B");

        Console.WriteLine(uc);
    }
}

.Net Fiddle Link

produces the following output...

A XYZ BA B

Do you think this is correct?

Shouldn't the output be...

A XYZ B

I think I am doing something stupid here. I would appreciate any help you can provide in helping me understand this issue.


Here is something interesting...

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var r = new Regex("(.*)");
        var c = "XYZ";
        var uc = r.Replace(c, "$1");

        Console.WriteLine(uc);
    }
}

.Net Fiddle

Output...

XYZ

like image 999
Autodidact Avatar asked Jan 24 '14 14:01

Autodidact


1 Answers

As for why the engine returns 2 matches, it is due to the way .NET (also Perl and Java) handles global matching, i.e. find all matches to the given pattern in an input string.

The process can be described as followed (current index is usually set to 0 at the beginning of a search, unless specified):

  1. From the current index, perform a search.
  2. If there is no match:
    1. If current index already points at the end of the string (current index >= string.length), return the result so far.
    2. Increment current index by 1, go to step 1.
  3. If the main match ($0) is non-empty (at least one character is consumed), add the result and set current index to the end of main match ($0). Then go to step 1.
  4. If the main match ($0) is empty:
    1. If the previous match is non-empty, add the result and go to step 1.
    2. If the previous match is empty, backtrack and continue searching.
    3. If the backtracking attempt finds a non-empty match, add the result, set current index to the end of the match and go to step 1.
    4. Otherwise, increment current index by 1. Go to step 1.

The engine needs to check for empty match; otherwise, it will end up in an infinite loop. The designer recognizes the usage of empty match (in splitting a string into characters, for example), so the engine must be designed to avoid getting stuck at a certain position forever.

This process explains why there is an empty match at the end: since a search is conducted at the end of the string (index 3) after (.*) matches abc, and (.*) can match an empty string, an empty match is found. And the engine does not produce infinite number of empty matches, since an empty match has already been found at the end.

 a b c
^ ^ ^ ^
0 1 2 3

First match:

 a b c
^     ^
0-----3

Second match:

 a b c
      ^
      3

With the global matching algorithm above, there can only be at most 2 matches starting at the same index, and such case can only happen when the first one is an empty match.

Note that JavaScript simply increment current index by 1 if the main match is empty, so there is at most 1 match per index. However, in this case (.*), if you use global flag g to do global matching, the same result would happen:

(Result below is from Firefox, note the g flag)

> "XYZ".replace(/(.*)/g, "A $1 B")
"A XYZ BA  B"
like image 139
nhahtdh Avatar answered Nov 14 '22 05:11

nhahtdh