Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java won't match .*

Tags:

java

regex

I have the following line in a file

00241386002|5296060|0|1|ClaimNote|29DEC2005:10:20:13.557194|JAR007|

I'm trying to match with

line.matches("^\d+\|\d+\|\d+\|\d+.+$")

That pattern works on the previous ~10k or so lines in the file. It also works on the immediately preceding line which is the same up through the timestamp. It does not, however, work on that line. Even

line.matches(".*")

returns false.

Any help would be appreciated.

edits:

  • the lines are created by buffered reader, so \r and \n will be trimmed.
  • already tried clean and build, no dice.

answer:

  • thanks to Pshemo with the answer in the first comment. (?d).* (unix mode) also works. there was a '\u0085' at the end of the line that the buffered reader didn't trim but Pattern considered a line terminator.
like image 847
Kevin Avatar asked Sep 02 '14 20:09

Kevin


People also ask

What are matchers in Java?

A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations: The matches method attempts to match the entire input sequence against the pattern.

What does matcher return in Java?

The matcher() method of this class accepts an object of the CharSequence class representing the input string and, returns a Matcher object which matches the given string to the regular expression represented by the current (Pattern) object.

Is Java a match?

Java String matches()The matches() method checks whether the string matches the given regular expression or not.

How do you check if a string matches a regex in Java?

The matcher() method is used to search for the pattern in a string. It returns a Matcher object which contains information about the search that was performed. The find() method returns true if the pattern was found in the string and false if it was not found.


1 Answers

Problem

\d+\|\d+\|\d+\|\d+ part of your regex seems to be working fine which suggests that problem must be related to .* part.

Lets test which characters can't by default by matched by . which could prevent matches from returning true.
(I will test only characters in range 0-FFFF but Unicode have more characters - like surrogate pairs - so I am not saying that these are only characters which . can't match - even if it is today we can't be sure about the future).

for (int ch = 0; ch < '\uFFFF'; ch++) {
    if (!Character.toString((char)ch).matches(".*")) {
        System.out.format("%-4d hex: \\u%04x %n", ch, ch);
    }
}

We will get as result (added some comments and links)

10 hex: \u000a - line feed (\n)
13 hex: \u000d - carriage return (\r)
133 hex: \u0085 - next line (NEL)
8232 hex: \u2028 - line separator
8233 hex: \u2029 - paragraph separator

So I suspect that your string contains one of these characters. Now, not all tools properly recognize these characters as proper line separators (which regex recognizes). For instance, lets test BufferedReader

String data = "AAA\nBBB\rCCC\u0085DDD\u2028EEE\u2029FFF";

BufferedReader br = new BufferedReader(new StringReader(data));
String line = null;
while((line = br.readLine())!=null){
    System.out.println(line);
}

we are getting as result:

AAA
BBB
CCCDDD
    EEE
    FFF
   ⬑ here we have `\u0085` (NEL) 

As you see tools which are not based on regex engine can return string which will represent single line, but still will contain characters which regex sees as line separators.

Possible solutions

We can try to let . match any characters. To do so we can use Pattern.DOTALL flag (we can enable it also by adding (?s) in regex like (?s).*).

Also as you already mention your question, we can set regex engine in Pattern.UNIX_LINES mode ((?d) flag), which will make it see only \n as line separator (other characters like \r will not be treated as line separators).

like image 105
Pshemo Avatar answered Sep 30 '22 17:09

Pshemo