Java scanner usage with \R pattern (issue with buffer boundary)

Tags:

Executive summary: Are there any caveats/known issues with \R (or other regex pattern) usage in Java's Scanner (especially regarding internal buffer's boundary conditions)?

Details: Since I wanted to do some multi-line pattern matching on potentially multi-platform input files, I used patterns with \R, which according to Pattern javadoc is:

Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

Anyhow, I noticed in one of my test files that the loop that's supposed to parse a block of a hex-dump was cut short. After some debugging, I noticed that the line that it was ending on was the end of Scanner's internal buffer.

Here's a test program I wrote to simulate the situation:

public static void main(String[] args) throws IOException {
    testString(1);
    testString(1022);
}

private static void testString(int prefixLen) {
    String suffix = "b\r\nX";
    String buffer = new String(new char[prefixLen]).replace("\0", "a") + suffix;

    Scanner scanner = new Scanner(buffer);
    String pattern = "b\\R";
    System.out.printf("=================\nTest String (Len=%d): '%s'\n'%s' found with horizon=0 (w/o bound): %s\n", buffer.length(), convertLineEndings(
        buffer), pattern, convertLineEndings(scanner.findWithinHorizon(pattern, 0)));
    System.out.printf("'X' found with horizon=1: %b\n", scanner.findWithinHorizon("X", 1) != null);
    scanner.close();
}

private static String convertLineEndings(String string) {
    return string.replaceAll("\\n", "\\\\n").replaceAll("\\r", "\\\\r");
}

... which produces this output (edited for formatting/brevity):

=================
Test String (Len=5): 'ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r\n
'X' found with horizon=1: true
=================
Test String (Len=1026): 'a ... ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r
'X' found with horizon=1: false

To me, this looks like a bug! I think the scanner should match that suffix with the patterns the same way independent of where they show up in the input text (as long as the prefix doesn't get involved with the patterns). (I have also found possibly relevant Open JDK Bugs 8176407, and 8072582, but this was with regular Oracle JDK 8u111).

But I may have missed some recommendations regarding scanner or particular \R pattern usage (or that Open JDK, and Oracle have identical(??) implementations for relevant classes here?)... hence the question!

585

asked Mar 02 '18 14:03

OzgurH

1 Answers

I tested this code at Ideone and it's no longer returning "false" on latest versions of Java.

https://www.ideone.com/4wwYSj

If, however, I were stuck on an old version or one which still exhibits the bug, and I needed a general purpose solution rather than a workaround for this one example, then I might try crafting a regex similar to \R but which forces an extra byte peek in the \r case. Note that the so-called "equivalent" pattern in the documentation is not truly equivalent, because it actually needs to be an atomic grouping. So you might end up with something like this:

(?>\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029](?=.|\Z))

108

answered Oct 15 '22 07:10

Patrick Parker

Related questions
                            
                                RESTEasy Client Exception Handling
                            
                                Is the JPA @Embedded annotation mandatory?
                            
                                Android : Putting Grid of dynamic&custom objects inside another Grid of dynamic&custom objects
                            
                                android.support.v4.util.Pair vs android.util.Pair
                            
                                PreparedStatement + Select for update + Oracle 12c + ORA-01461 in primary key column
                            
                                One test watcher to report results of individual tests in JUnit Suite
                            
                                Do not update row in ResultSet if data has changed
                            
                                java.library.path, classpath Netbeans 8.0.2
                            
                                Java Stream Generics Type Mismatch
                            
                                Retrofit v2 Does Call.cancel() remove Callback?
                            
                                How to specify null value for a string resource in xml in Android?
                            
                                Want to find the Focal length first then distance of face detected in real time using opencv android
                            
                                Jetty: Redirect HTTP to HTTPS for static content
                            
                                How to remove outdated Maven artifacts from Jenkins Maven2/3 job?
                            
                                What pattern should be used to parse RFC 3339 datetime strings in java
                            
                                Spring Rest - Exception when sending a List of files
                            
                                Java CDI: Decorator with multiple generic params
                            
                                Appium - How to set Geo Location on iOS Device?
                            
                                Kafka Rebalancing. Duplicate processing issue
                            
                                How to set Foreign key name in @OneToOne relation with @JoinColumn and @MapsId

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java scanner usage with \R pattern (issue with buffer boundary)

Tags:

java

regex

java.util.scanner

OzgurH

People also ask

1 Answers

Patrick Parker

Recent Activity

Donate For Us