Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When does '.' not match in a Regex?

Tags:

java

regex

I encountered the following problem (simplified). I wrote the following

Pattern pattern = Pattern.compile("Fig.*");
String s = readMyString();
Matcher matcher = pattern.matcher(s);

In reading one string the matcher failed to match even though it started with "Fig". I tracked the problem down to a rogue character in the next part of the string. It had codePoint value 1633 from

(int) charAt(i)

but did not match the regex. I think it is due to a non-UTF-8 encoding somewhere in the input process.

The Javadocs say:

Predefined character classes . Any character (may or may not match line terminators)

Presumably this is not a character in the strict sense of the word, but is is still part of the String. How do I detect this problem?

UPDATE: It was due to a (char)10 which was not easy to spot. My diagnosis above is wrong and all answers below are relevant to the question as asked and are useful.

like image 556
peter.murray.rust Avatar asked Apr 22 '13 14:04

peter.murray.rust


2 Answers

It's easy enough to check this:

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        Pattern pattern = Pattern.compile(".");
        for (char c = 0; c < 0xffff; c++) {
            String text = String.valueOf(c);
            if (!pattern.matcher(text).matches()) {
                System.out.println((int) c);
            }
        }
    }
}

On my box, the output is:

10
13
133
8232
8233

Of these, 10 and 13 are "\n" and "\r" respectively. 133 (U+0085) is "next line", 8232 (U+2028) is "line separator" and 8233 (U+2029) is "paragraph separator".

Note that:

  • This doesn't test any Unicode characters outside the basic multi-lingual plane
  • It only uses the default options
  • It seems to contradict your experience of character 1633 (U+0661)
like image 116
Jon Skeet Avatar answered Oct 14 '22 16:10

Jon Skeet


The . character in a Java regex matches any character except line terminators, unless you use the flag Pattern.DOTALL when compiling your pattern.

To do so, you would use a Pattern like this:

Pattern p = Pattern.compile("somepattern", Pattern.DOTALL);
like image 45
pcalcao Avatar answered Oct 14 '22 15:10

pcalcao