I encountered the following problem (simplified). I wrote the following
Pattern pattern = Pattern.compile("Fig.*");
String s = readMyString();
Matcher matcher = pattern.matcher(s);
In reading one string the matcher failed to match even though it started with "Fig". I tracked the problem down to a rogue character in the next part of the string. It had codePoint value 1633 from
(int) charAt(i)
but did not match the regex. I think it is due to a non-UTF-8 encoding somewhere in the input process.
The Javadocs say:
Predefined character classes . Any character (may or may not match line terminators)
Presumably this is not a character in the strict sense of the word, but is is still part of the String. How do I detect this problem?
UPDATE: It was due to a (char)10 which was not easy to spot. My diagnosis above is wrong and all answers below are relevant to the question as asked and are useful.
It's easy enough to check this:
import java.util.regex.*;
public class Test {
public static void main(String[] args) {
Pattern pattern = Pattern.compile(".");
for (char c = 0; c < 0xffff; c++) {
String text = String.valueOf(c);
if (!pattern.matcher(text).matches()) {
System.out.println((int) c);
}
}
}
}
On my box, the output is:
10
13
133
8232
8233
Of these, 10 and 13 are "\n" and "\r" respectively. 133 (U+0085) is "next line", 8232 (U+2028) is "line separator" and 8233 (U+2029) is "paragraph separator".
Note that:
The .
character in a Java regex matches any character except line terminators, unless you use the flag Pattern.DOTALL
when compiling your pattern.
To do so, you would use a Pattern like this:
Pattern p = Pattern.compile("somepattern", Pattern.DOTALL);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With