Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does \R behave differently in regular expressions between Java 8 and Java 9?

The following code compiles in both Java 8 & 9, but behaves differently.

class Simple {     static String sample = "\nEn un lugar\r\nde la Mancha\nde cuyo nombre\r\nno quiero acordarme";      public static void main(String args[]){         String[] chunks = sample.split("\\R\\R");         for (String chunk: chunks) {             System.out.println("Chunk : "+chunk);         }     } } 

When I run it with Java 8 it returns:

Chunk :  En un lugar de la Mancha de cuyo nombre no quiero acordarme 

But when I run it with Java 9 the output is different:

Chunk :  En un lugar Chunk : de la Mancha de cuyo nombre Chunk : no quiero acordarme 

Why?

like image 557
Germán Bouzas Avatar asked Dec 18 '17 15:12

Germán Bouzas


People also ask

What does \\ mean in Java Regex?

String regex = "\\."; Notice that the regular expression String contains two backslashes after each other, and then a . . The reason is, that first the Java compiler interprets the two \\ characters as an escaped Java String character. After the Java compiler is done, only one \ is left, as \\ means the character \ .

What is the use of \\ in Java?

The backslash \ is an escape character in Java Strings. That means backslash has a predefined meaning in Java. You have to use double backslash \\ to define a single backslash. If you want to define \w , then you must be using \\w in your regex.

What does \\ s+ in Java mean?

The Java regex pattern \\s+ is used to match multiple whitespace characters when applying a regex search to your specified value. The pattern is a modified version of \\s which is used to match a single whitespace character.

What is the use of r in regular expression?

Placing r or R before a string literal creates what is known as a raw-string literal. Raw strings do not process escape sequences ( \n , \b , etc.) and are thus commonly used for Regex patterns, which often contain a lot of \ characters.


Video Answer


2 Answers

It was a bug in Java 8 and it got fixed: JDK-8176029 : "Linebreak matcher is not equivalent to the pattern as stated in javadoc".

Also see: Java-8 regex negative lookbehind with `\R`

like image 82
user158037 Avatar answered Oct 05 '22 23:10

user158037


The Java documentation is out of conformance with the Unicode Standard. The Javadoc mistates what \R is supposed to match. It reads:

\R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

That Java documentation is buggy. In its section on R1.6 Line Breaks, Unicode Technical Standard #18 on Regular Expressions clearly states:

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.

 (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}] 

In other words, it can only match a two code-point CR+LF (carriage return + linefeed) sequence or else a single code-point from that set provided that it is not just a carriage return alone that is then followed by a linefeed. That’s because it is not allowed to back up. CRLF must be atomic for \R to function properly.

So Java 9 no longer conforms to what R1.6 strongly recommends. Moreover, it is now doing something that it was supposed to NOT do, and did not do, in Java 8.

Looks like it's time for me to give Sherman (read: Xueming Shen) a holler again. I've worked with him before on these nitty-gritty matters of formal conformance.

like image 42
tchrist Avatar answered Oct 06 '22 01:10

tchrist