When using Matcher's find()
method, a partial match returns false but the matcher's position moves anyway. A subsequent invocation of find()
omits those partially matched characters.
Example of a partial match: the pattern "[0-9]+:[0-9]"
against the input "a3;9"
. This pattern doesn't match against any part of the input, so find()
returns false, but the subpattern "[0-9]+"
matches against "3"
. If we change the pattern at this point and call find()
again, the characters to the left of, and including the partial match, are not tested for a new match.
Note that pattern "[0-9]:[0-9]"
(without the quantifier) doesn't produce this effect.
Is this normal behaviour?
Example: in the first for loop, the third pattern [0-9]
matches against character "9"
and "3"
is not reported as a match. In the second loop, pattern [0-9]
matches against character "3"
.
import java.util.regex.*;
public class Test {
public static void main(String[] args) {
final String INPUT = "a3;9";
String[] patterns = {"a", "[0-9]+:[0-9]", "[0-9]"};
Matcher matcher = Pattern.compile(".*").matcher(INPUT);
System.out.printf("Input: %s%n", INPUT);
matcher.reset();
for (String s: patterns)
testPattern(matcher, s);
System.out.println("=======================================");
patterns = new String[] {"a", "[0-9]:[0-9]", "[0-9]"};
matcher.reset();
for (String s: patterns)
testPattern(matcher, s);
}
static void testPattern(Matcher m, String re) {
m.usePattern(Pattern.compile(re));
System.out.printf("Using regex: %s%n", m.pattern().toString());
// Testing for pattern
if(m.find())
System.out.printf("Found %s, end-pos: %d%n", m.group(), m.end());
}
}
Matcher proposes three different kind of match operations (see javadoc)
- matches
for an entire input match
- find
for a traversal skipping unmatched
- lookingAt
that does a partial match from the start of the sequence
When a pattern is found by lookingAt
invoking matcher.region(matcher.end(), matcher.regionEnd())
or such can be used for consecutive pattern.
(Most of the credit goes to the OP self)
As per Javadoc of Matcher#usePattern
:
This method causes this matcher to lose information about the groups of the last match that occurred. The matcher's position in the input is maintained and its last append position is unaffected.
So as per this documentation usePattern
only guarantees to lose information about the groups of the last match. All other state information in Matcher
class is not reset in this method.
This is the actual code inside usePattern
method that shows it is only initializing groups:
public Matcher usePattern(Pattern newPattern) {
if (newPattern == null)
throw new IllegalArgumentException("Pattern cannot be null");
parentPattern = newPattern;
// Reallocate state storage
int parentGroupCount = Math.max(newPattern.capturingGroupCount, 10);
groups = new int[parentGroupCount * 2];
locals = new int[newPattern.localCount];
for (int i = 0; i < groups.length; i++)
groups[i] = -1;
for (int i = 0; i < locals.length; i++)
locals[i] = -1;
return this;
}
Note that Matcher
class has private variables first
and last
that are not exposed using any public methods. If we use reflection
APIs then we can see evidence of what's going wrong here.
Check this code block:
public class UseMatcher {
final static String INPUT = "a3#9";
static Matcher m = Pattern.compile("").matcher("");
public static void main(String[] args) throws Exception {
executePatterns(new String[] {"a", "[0-9]+:[0-9]", "[0-9]"});
executePatterns(new String[] {"a", "[0-9]:[0-9]", "[0-9]"});
}
static void executePatterns(String[] patterns) throws Exception {
System.out.printf("================= \"%s\" ======================%n", INPUT);
m.reset(INPUT);
boolean found = false;
for (String re: patterns) {
m.usePattern(Pattern.compile(re));
System.out.printf("first/last: %s/%s, Using regex: \"%s\"%n",
matcherField("first"), matcherField("last"), m.pattern());
found = m.find();
if (found) {
System.out.printf("Found %s, end-pos: %d%n", m.group(), m.end());
}
}
}
static Object matcherField(String fieldName) throws Exception {
Field field = m.getClass().getDeclaredField(fieldName);
field.setAccessible(true);
return field.get(m);
}
}
Output:
================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]+:[0-9]"
first/last: -1/2, Using regex: "[0-9]"
Found 9, end-pos: 4
================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]:[0-9]"
first/last: -1/1, Using regex: "[0-9]"
Found 3, end-pos: 2
Check difference in first/last
positions after applying patterns "[0-9]+:[0-9]"
and "[0-9]:[0-9]"
. In the 1st case last
becomes 2
whereas in 2nd case last
remains at 1
. Hence when call find()
next time we get different matches i.e. 9 vs 3
.
Since it i evident that matcher
is not resetting last
position on every call of usePattern
, we can call overloaded find(int Start)
method and supply end position from last successful find
method call.
static void executePatterns(String[] patterns) throws Exception {
System.out.printf("================= \"%s\" ======================%n", INPUT);
m.reset(INPUT);
boolean found = false;
int nextStart = 0;
for (String re: patterns) {
m.usePattern(Pattern.compile(re));
System.out.printf("first/last: %s/%s, Using regex: \"%s\"%n", matcherField("first"), matcherField("last"), m.pattern());
found = m.find(nextStart);
if (found) {
System.out.printf("Found %s, end-pos: %d%n", m.group(), m.end());
nextStart = m.end();
}
}
}
When we call it from same main
method as shown above we will get following output:
================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]+:[0-9]"
first/last: -1/2, Using regex: "[0-9]"
Found 3, end-pos: 2
================= "a3#9" ======================
first/last: -1/0, Using regex: "a"
Found a, end-pos: 1
first/last: 0/1, Using regex: "[0-9]:[0-9]"
first/last: -1/0, Using regex: "[0-9]"
Found 3, end-pos: 2
Even though this output still shows same first/last
positions as in previous output, it does find correct substring 3
both the times using 2 different patterns due to use of find(int start)
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With