Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java scanner usage with \R pattern (issue with buffer boundary)

Executive summary: Are there any caveats/known issues with \R (or other regex pattern) usage in Java's Scanner (especially regarding internal buffer's boundary conditions)?

Details: Since I wanted to do some multi-line pattern matching on potentially multi-platform input files, I used patterns with \R, which according to Pattern javadoc is:

Any Unicode linebreak sequence, is equivalent to \u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]

Anyhow, I noticed in one of my test files that the loop that's supposed to parse a block of a hex-dump was cut short. After some debugging, I noticed that the line that it was ending on was the end of Scanner's internal buffer.

Here's a test program I wrote to simulate the situation:

public static void main(String[] args) throws IOException {
    testString(1);
    testString(1022);
}

private static void testString(int prefixLen) {
    String suffix = "b\r\nX";
    String buffer = new String(new char[prefixLen]).replace("\0", "a") + suffix;

    Scanner scanner = new Scanner(buffer);
    String pattern = "b\\R";
    System.out.printf("=================\nTest String (Len=%d): '%s'\n'%s' found with horizon=0 (w/o bound): %s\n", buffer.length(), convertLineEndings(
        buffer), pattern, convertLineEndings(scanner.findWithinHorizon(pattern, 0)));
    System.out.printf("'X' found with horizon=1: %b\n", scanner.findWithinHorizon("X", 1) != null);
    scanner.close();
}

private static String convertLineEndings(String string) {
    return string.replaceAll("\\n", "\\\\n").replaceAll("\\r", "\\\\r");
}

... which produces this output (edited for formatting/brevity):

=================
Test String (Len=5): 'ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r\n
'X' found with horizon=1: true
=================
Test String (Len=1026): 'a ... ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r
'X' found with horizon=1: false

To me, this looks like a bug! I think the scanner should match that suffix with the patterns the same way independent of where they show up in the input text (as long as the prefix doesn't get involved with the patterns). (I have also found possibly relevant Open JDK Bugs 8176407, and 8072582, but this was with regular Oracle JDK 8u111).

But I may have missed some recommendations regarding scanner or particular \R pattern usage (or that Open JDK, and Oracle have identical(??) implementations for relevant classes here?)... hence the question!

like image 585
OzgurH Avatar asked Mar 02 '18 14:03

OzgurH


People also ask

What is the difference between BufferedReader and scanner in Java?

BufferedReader simply reads the sequence of characters in a portion that depends on the buffer size. The scanner has a little buffer (1KB byte buffer). The scanner is slow as it does the parsing of input data. Moreover, It hides IOException. Unlike Scanner, BufferedReader simply reads the sequence of characters.

What is scanner next (pattern pattern) method in JavaScript?

The Scanner next (Pattern pattern) method is usually used when we are interested at specific token patterns like if we want to get integers only. This method really helpful as well in catching invalid tokens. I will not go deeper into discussing regular expressions thus we have limited the discussion only in basic pattern.

How do I use the scanner class in Java?

The Scanner class is used to get user input, and it is found in the java.util package. To use the Scanner class, create an object of the class and use any of the available methods found in the Scanner class documentation. In our example, we will use the nextLine() method, which is used to read Strings:

Why does it return false if the buffer can't find empty?

It returns false if the buffer can't find an empty slot, that is, we can't overwrite unread values. Let's implement the offer method in Java: So, we're incrementing the write sequence and computing the index in the array for the next available slot. Then, we're writing the data to the buffer and storing the updated write sequence. 3.3. Poll


1 Answers

I tested this code at Ideone and it's no longer returning "false" on latest versions of Java.

https://www.ideone.com/4wwYSj

If, however, I were stuck on an old version or one which still exhibits the bug, and I needed a general purpose solution rather than a workaround for this one example, then I might try crafting a regex similar to \R but which forces an extra byte peek in the \r case. Note that the so-called "equivalent" pattern in the documentation is not truly equivalent, because it actually needs to be an atomic grouping. So you might end up with something like this:

(?>\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029](?=.|\Z))

like image 108
Patrick Parker Avatar answered Oct 15 '22 07:10

Patrick Parker