Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

confusion in behavior of capturing groups in java regex

In this answer I recommended using

s.replaceFirst("\\.0*$|(\\.\\d*?)0+$", "$1");

but two people complained that the result contained the string "null", e.g., 23.null. This could be explained by $1 (i.e., group(1)) being null, which could be transformed via String.valueOf to the string "null". However, I always get the empty string. My testcase covers it and

assertEquals("23", removeTrailingZeros("23.00"));

passes. Is the exact behavior undefined?

like image 464
maaartinus Avatar asked Dec 12 '14 08:12

maaartinus


3 Answers

The documentation of Matcher class from the reference implementation doesn't specify the behavior of appendReplacement method when a capturing group which doesn't capture anything (null) is specified in the replacement string. While the behavior of group method is clear, nothing is mentioned in appendReplacement method.

Below are 3 exhibits of difference in implementation for the case above:

  • The reference implementation does not append anything (or we can say append an empty string) for the case above.
  • GNU Classpath and Android's implementation appends null for the case above.

Some code has been omitted for the sake of brevity, and is indicated by ....

1) Sun/Oracle JDK, OpenJDK (Reference implementation)

For the reference implementation (Sun/Oracle JDK and OpenJDK), the code for appendReplacement doesn't seem to have changed from Java 6, and it will not append anything when a capturing group doesn't capture anything:

        } else if (nextChar == '$') {
            // Skip past $
            cursor++;
            // The first number is always a group
            int refNum = (int)replacement.charAt(cursor) - '0';
            if ((refNum < 0)||(refNum > 9))
                throw new IllegalArgumentException(
                    "Illegal group reference");
            cursor++;

            // Capture the largest legal group string
            ...

            // Append group
            if (start(refNum) != -1 && end(refNum) != -1)
                result.append(text, start(refNum), end(refNum));
        } else {

Reference

  • jdk6/98e143b44620
  • jdk8/687fd7c7986d

2) GNU Classpath

GNU Classpath, which is a complete reimplementation of Java Class Library has a different implementation for appendReplacement in the case above. In Classpath, the classes in java.util.regex package in Classpath is just a wrapper for classes in gnu.java.util.regex.

Matcher.appendReplacement calls RE.getReplacement to process replacement for the matched portion:

  public Matcher appendReplacement (StringBuffer sb, String replacement)
    throws IllegalStateException
  {
    assertMatchOp();
    sb.append(input.subSequence(appendPosition,
                                match.getStartIndex()).toString());
    sb.append(RE.getReplacement(replacement, match,
        RE.REG_REPLACE_USE_BACKSLASHESCAPE));
    appendPosition = match.getEndIndex();
    return this;
  }

RE.getReplacement calls REMatch.substituteInto to get the content of the capturing group and appends its result directly:

                  case '$':
                    int i1 = i + 1;
                    while (i1 < replace.length () &&
                           Character.isDigit (replace.charAt (i1)))
                      i1++;
                    sb.append (m.substituteInto (replace.substring (i, i1)));
                    i = i1 - 1;
                    break;

REMatch.substituteInto appends the result of REMatch.toString(int) directly without checking whether the capturing group has captured anything:

        if ((input.charAt (pos) == '$')
            && (Character.isDigit (input.charAt (pos + 1))))
          {
            // Omitted code parses the group number into val
            ...

            if (val < start.length)
              {
                output.append (toString (val));
              }
          }

And REMatch.toString(int) returns null when the capturing group doesn't capture (irrelevant code has been omitted).

  public String toString (int sub)
  {
    if ((sub >= start.length) || sub < 0)
      throw new IndexOutOfBoundsException ("No group " + sub);
    if (start[sub] == -1)
      return null;
    ...
  }

So in GNU Classpath's case, null will be appended to the string when a capturing group which fails to capture anything is specified in the replacement string.

3) Android Open Source Project - Java Core Libraries

In Android, Matcher.appendReplacement calls private method appendEvaluated, which in turn directly appends the result of group(int) to the replacement string.

public Matcher appendReplacement(StringBuffer buffer, String replacement) {
    buffer.append(input.substring(appendPos, start()));
    appendEvaluated(buffer, replacement);
    appendPos = end();
    return this;
}

private void appendEvaluated(StringBuffer buffer, String s) {
    boolean escape = false;
    boolean dollar = false;
    for (int i = 0; i < s.length(); i++) {
        char c = s.charAt(i);
        if (c == '\\' && !escape) {
            escape = true;
        } else if (c == '$' && !escape) {
            dollar = true;
        } else if (c >= '0' && c <= '9' && dollar) {
            buffer.append(group(c - '0'));
            dollar = false;
        } else {
            buffer.append(c);
            dollar = false;
            escape = false;
        }
    }
    // This seemingly stupid piece of code reproduces a JDK bug.
    if (escape) {
        throw new ArrayIndexOutOfBoundsException(s.length());
    }
}

Since Matcher.group(int) returns null for capturing group which fails to capture, Matcher.appendReplacement appends null when the capturing group is referred to in the replacement string.

It is most likely that the 2 people complaining to you are running their code on Android.

like image 189
nhahtdh Avatar answered Oct 30 '22 12:10

nhahtdh


Having had a careful look at the Javadoc, I conclude that:

  1. $1 is equivalent to calling group(1), which is specified to return null when the group didn't get captured.
  2. The handling of nulls in the replacement expression is unspecified.

The wording of the relevant parts of the Javadoc is on the whole surprisingly vague (emphasis mine):

Dollar signs may be treated as references to captured subsequences as described above...

like image 23
NPE Avatar answered Oct 30 '22 13:10

NPE


You have two alternatives | or-ed together, but only the second is between ( ) hence if the first alternative is matched, group 1 is null.

In general place the parentheses around all alternatives

In your case you want to replace

  • "xxx.00000" by "xxx" or else
  • "xxx.yyy00" by "xxx.yyy"

Better do that in two steps, as that is more readable:

  • "xxx.y*00" by "xxx.y*" then
  • "xxx." by "xxx"

This does a bit extra, changing an initial "1." to "1". So:

.replaceFirst("(\\.\\d*?)0+$", "$1").replaceFirst("\\.$", "");
like image 25
Joop Eggen Avatar answered Oct 30 '22 12:10

Joop Eggen