Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

A partial match changes the Matcher's position

When using Matcher's find() method, a partial match returns false but the matcher's position moves anyway. A subsequent invocation of find() omits those partially matched characters.

Example of a partial match: the pattern "[0-9]+:[0-9]" against the input "a3;9". This pattern doesn't match against any part of the input, so find() returns false, but the subpattern "[0-9]+" matches against "3". If we change the pattern at this point and call find() again, the characters to the left of, and including the partial match, are not tested for a new match.

Note that pattern "[0-9]:[0-9]" (without the quantifier) doesn't produce this effect.

Is this normal behaviour?

Example: in the first for loop, the third pattern [0-9] matches against character "9" and "3" is not reported as a match. In the second loop, pattern [0-9] matches against character "3".

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        final String INPUT = "a3;9";
        String[] patterns = {"a", "[0-9]+:[0-9]", "[0-9]"};

        Matcher matcher = Pattern.compile(".*").matcher(INPUT);

        System.out.printf("Input: %s%n", INPUT);
        matcher.reset();
        for (String s: patterns)
            testPattern(matcher, s);

        System.out.println("=======================================");

        patterns = new String[] {"a", "[0-9]:[0-9]", "[0-9]"};
        matcher.reset();
        for (String s: patterns)
            testPattern(matcher, s);
    }

    static void testPattern(Matcher m, String re) {     
        m.usePattern(Pattern.compile(re));
        System.out.printf("Using regex: %s%n", m.pattern().toString());

        // Testing for pattern
        if(m.find())
            System.out.printf("Found %s, end-pos: %d%n", m.group(), m.end());
    }
}
like image 579
Cutter Avatar asked Jun 17 '17 10:06

Cutter


2 Answers

Matcher proposes three different kind of match operations (see javadoc) - matches for an entire input match - find for a traversal skipping unmatched - lookingAt that does a partial match from the start of the sequence

When a pattern is found by lookingAt invoking matcher.region(matcher.end(), matcher.regionEnd()) or such can be used for consecutive pattern.

(Most of the credit goes to the OP self)

like image 114
Joop Eggen Avatar answered Nov 04 '22 09:11

Joop Eggen


As per Javadoc of Matcher#usePattern:

This method causes this matcher to lose information about the groups of the last match that occurred. The matcher's position in the input is maintained and its last append position is unaffected.

So as per this documentation usePattern only guarantees to lose information about the groups of the last match. All other state information in Matcher class is not reset in this method.

This is the actual code inside usePattern method that shows it is only initializing groups:

public Matcher usePattern(Pattern newPattern) {
    if (newPattern == null)
        throw new IllegalArgumentException("Pattern cannot be null");
    parentPattern = newPattern;

    // Reallocate state storage
    int parentGroupCount = Math.max(newPattern.capturingGroupCount, 10);
    groups = new int[parentGroupCount * 2];
    locals = new int[newPattern.localCount];
    for (int i = 0; i < groups.length; i++)
        groups[i] = -1;
    for (int i = 0; i < locals.length; i++)
        locals[i] = -1;
    return this;
}

Note that Matcher class has private variables first and last that are not exposed using any public methods. If we use reflection APIs then we can see evidence of what's going wrong here.

Check this code block:

public class UseMatcher {
    final static String INPUT = "a3#9";
    static Matcher m = Pattern.compile("").matcher("");

    public static void main(String[] args) throws Exception {
        executePatterns(new String[] {"a", "[0-9]+:[0-9]", "[0-9]"});
        executePatterns(new String[] {"a", "[0-9]:[0-9]", "[0-9]"});
    }

    static void executePatterns(String[] patterns) throws Exception {
        System.out.printf("================= \"%s\" ======================%n", INPUT);
        m.reset(INPUT);

        boolean found = false;
        for (String re: patterns) {
            m.usePattern(Pattern.compile(re));
            System.out.printf("first/last: %s/%s, Using regex: \"%s\"%n",
                   matcherField("first"), matcherField("last"), m.pattern());

            found = m.find();
            if (found) {
                System.out.printf("Found %s, end-pos: %d%n", m.group(), m.end());
            }
        }
    }

    static Object matcherField(String fieldName) throws Exception {
        Field field = m.getClass().getDeclaredField(fieldName);    
        field.setAccessible(true);
        return field.get(m);
    }
}

Output:

================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]+:[0-9]"
first/last: -1/2, Using regex: "[0-9]"
Found 9, end-pos: 4
================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]:[0-9]"
first/last: -1/1, Using regex: "[0-9]"
Found 3, end-pos: 2

Check difference in first/last positions after applying patterns "[0-9]+:[0-9]" and "[0-9]:[0-9]". In the 1st case last becomes 2 whereas in 2nd case last remains at 1. Hence when call find() next time we get different matches i.e. 9 vs 3.


FIX

Since it i evident that matcher is not resetting last position on every call of usePattern, we can call overloaded find(int Start) method and supply end position from last successful find method call.

static void executePatterns(String[] patterns) throws Exception {
    System.out.printf("================= \"%s\" ======================%n", INPUT);
    m.reset(INPUT);

    boolean found = false;
    int nextStart = 0;
    for (String re: patterns) {
        m.usePattern(Pattern.compile(re));
        System.out.printf("first/last: %s/%s, Using regex: \"%s\"%n", matcherField("first"), matcherField("last"), m.pattern());

        found = m.find(nextStart);
        if (found) {
            System.out.printf("Found %s, end-pos: %d%n", m.group(), m.end());
            nextStart = m.end();
        }
    }
}

When we call it from same main method as shown above we will get following output:

================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]+:[0-9]"
first/last: -1/2, Using regex: "[0-9]"
Found 3, end-pos: 2
================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]:[0-9]"
first/last: -1/0, Using regex: "[0-9]"
Found 3, end-pos: 2

Even though this output still shows same first/last positions as in previous output, it does find correct substring 3 both the times using 2 different patterns due to use of find(int start) method.

  • Here is Code demo for working fix
like image 26
anubhava Avatar answered Nov 04 '22 08:11

anubhava