I have the following line in a file
00241386002|5296060|0|1|ClaimNote|29DEC2005:10:20:13.557194|JAR007|
I'm trying to match with
line.matches("^\d+\|\d+\|\d+\|\d+.+$")
That pattern works on the previous ~10k or so lines in the file. It also works on the immediately preceding line which is the same up through the timestamp. It does not, however, work on that line. Even
line.matches(".*")
returns false.
Any help would be appreciated.
edits:
\r
and \n
will be trimmed.answer:
A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations: The matches method attempts to match the entire input sequence against the pattern.
The matcher() method of this class accepts an object of the CharSequence class representing the input string and, returns a Matcher object which matches the given string to the regular expression represented by the current (Pattern) object.
Java String matches()The matches() method checks whether the string matches the given regular expression or not.
The matcher() method is used to search for the pattern in a string. It returns a Matcher object which contains information about the search that was performed. The find() method returns true if the pattern was found in the string and false if it was not found.
\d+\|\d+\|\d+\|\d+
part of your regex seems to be working fine which suggests that problem must be related to .*
part.
Lets test which characters can't by default by matched by .
which could prevent matches
from returning true
.
(I will test only characters in range 0
-FFFF
but Unicode have more characters - like surrogate pairs - so I am not saying that these are only characters which .
can't match - even if it is today we can't be sure about the future).
for (int ch = 0; ch < '\uFFFF'; ch++) {
if (!Character.toString((char)ch).matches(".*")) {
System.out.format("%-4d hex: \\u%04x %n", ch, ch);
}
}
We will get as result (added some comments and links)
10 hex: \u000a
- line feed (\n)13 hex: \u000d
- carriage return (\r)133 hex: \u0085
- next line (NEL)8232 hex: \u2028
- line separator8233 hex: \u2029
- paragraph separator
So I suspect that your string contains one of these characters. Now, not all tools properly recognize these characters as proper line separators (which regex recognizes). For instance, lets test BufferedReader
String data = "AAA\nBBB\rCCC\u0085DDD\u2028EEE\u2029FFF";
BufferedReader br = new BufferedReader(new StringReader(data));
String line = null;
while((line = br.readLine())!=null){
System.out.println(line);
}
we are getting as result:
AAA BBB CCCDDD EEE FFF ⬑ here we have `\u0085` (NEL)
As you see tools which are not based on regex engine can return string which will represent single line, but still will contain characters which regex sees as line separators.
We can try to let .
match any characters. To do so we can use Pattern.DOTALL
flag (we can enable it also by adding (?s)
in regex like (?s).*
).
Also as you already mention your question, we can set regex engine in Pattern.UNIX_LINES
mode ((?d)
flag), which will make it see only \n
as line separator (other characters like \r
will not be treated as line separators).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With