The following code compiles in both Java 8 & 9, but behaves differently. <pre class="prettyprint lang-java prettyprint-override"><code>class Simple { static String sample = "\nEn un lugar\r\nde la Mancha\nde cuyo nombre\r\nno quiero acordarme"; public static void main(String args[]){ String[] chunks = sample.split("\\R\\R"); for (String chunk: chunks) { System.out.println("Chunk : "+chunk); } } } </code></pre> When I run it with Java 8 it returns: <pre class="prettyprint lang-none prettyprint-override"><code>Chunk : En un lugar de la Mancha de cuyo nombre no quiero acordarme </code></pre> But when I run it with Java 9 the output is different: <pre class="prettyprint lang-none prettyprint-override"><code>Chunk : En un lugar Chunk : de la Mancha de cuyo nombre Chunk : no quiero acordarme </code></pre> Why?

The Java documentation is out of conformance with the Unicode Standard. The Javadoc mistates what <code>\R</code> is supposed to match. It reads: <blockquote> <code>\R</code> Any Unicode linebreak sequence, is equivalent to <code>\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]</code> </blockquote> That Java documentation is buggy. In its section on R1.6 Line Breaks, Unicode Technical Standard #18 on Regular Expressions clearly states: <blockquote> It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup. <pre class="prettyprint"><code> (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}] </code></pre> </blockquote> In other words, it can only match a two code-point CR+LF (carriage return + linefeed) sequence or else a single code-point from that set provided that it is not just a carriage return alone that is then followed by a linefeed. That’s because it is not allowed to back up. CRLF must be atomic for <code>\R</code> to function properly. So Java 9 no longer conforms to what R1.6 strongly recommends. Moreover, it is now doing something that it was supposed to NOT do, and did not do, in Java 8. Looks like it's time for me to give Sherman (read: Xueming Shen) a holler again. I've worked with him before on these nitty-gritty matters of formal conformance.

Why does \R behave differently in regular expressions between Java 8 and Java 9?

Tags:

java

regex

unicode

java-8

java-9

The following code compiles in both Java 8 & 9, but behaves differently.

class Simple {     static String sample = "\nEn un lugar\r\nde la Mancha\nde cuyo nombre\r\nno quiero acordarme";      public static void main(String args[]){         String[] chunks = sample.split("\\R\\R");         for (String chunk: chunks) {             System.out.println("Chunk : "+chunk);         }     } }

When I run it with Java 8 it returns:

Chunk :  En un lugar de la Mancha de cuyo nombre no quiero acordarme

But when I run it with Java 9 the output is different:

Chunk :  En un lugar Chunk : de la Mancha de cuyo nombre Chunk : no quiero acordarme

Why?

557

asked Dec 18 '17 15:12

Germán Bouzas

Video Answer

2 Answers

It was a bug in Java 8 and it got fixed: JDK-8176029 : "Linebreak matcher is not equivalent to the pattern as stated in javadoc".

Also see: Java-8 regex negative lookbehind with `\R`

answered Oct 05 '22 23:10

user158037

The Java documentation is out of conformance with the Unicode Standard. The Javadoc mistates what \R is supposed to match. It reads:

\R Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

That Java documentation is buggy. In its section on R1.6 Line Breaks, Unicode Technical Standard #18 on Regular Expressions clearly states:

It is strongly recommended that there be a regular expression meta-character, such as "\R", for matching all line ending characters and sequences listed above (for example, in #1). This would correspond to something equivalent to the following expression. That expression is slightly complicated by the need to avoid backup.
 (?:\u{D A}|(?!\u{D A})[\u{A}-\u{D}\u{85}\u{2028}\u{2029}] 

In other words, it can only match a two code-point CR+LF (carriage return + linefeed) sequence or else a single code-point from that set provided that it is not just a carriage return alone that is then followed by a linefeed. That’s because it is not allowed to back up. CRLF must be atomic for \R to function properly.

So Java 9 no longer conforms to what R1.6 strongly recommends. Moreover, it is now doing something that it was supposed to NOT do, and did not do, in Java 8.

Looks like it's time for me to give Sherman (read: Xueming Shen) a holler again. I've worked with him before on these nitty-gritty matters of formal conformance.

answered Oct 06 '22 01:10

tchrist

Related questions
                            
                                Iterate an Enumeration in Java 8
                            
                                Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
                            
                                How do I format a String in an email so Outlook will print the line breaks?
                            
                                Spring Data JPA and Exists query
                            
                                How to use VectorDrawables in Android API lower than 21?
                            
                                Android ListView selected item stay highlighted
                            
                                What's the difference between JavaScript and Java?
                            
                                Does entrySet() in a LinkedHashMap also guarantee order?
                            
                                Best practices for using and persisting enums
                            
                                Why does HttpServlet implement Serializable?
                            
                                Implementing coroutines in Java
                            
                                Mockito, JUnit and Spring
                            
                                How does paintComponent work?
                            
                                Cannot construct instance of - Jackson
                            
                                Is stopwatch benchmarking acceptable?
                            
                                IntelliJ IDEA "The selected directory is not a valid home for JDK"
                            
                                Java BigDecimal Possible Overflow Bug
                            
                                Precise definition of "functional interface" in Java 8
                            
                                Is JVM open source code?
                            
                                Spring configuration XML schema: with or without version?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With