Executive summary: Are there any caveats/known issues with \R
(or other regex pattern) usage in Java's Scanner
(especially regarding internal buffer's boundary conditions)?
Details: Since I wanted to do some multi-line pattern matching on potentially multi-platform input files, I used patterns with \R
, which according to Pattern
javadoc is:
Any Unicode linebreak sequence, is equivalent to
\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029]
Anyhow, I noticed in one of my test files that the loop that's supposed to parse a block of a hex-dump was cut short. After some debugging, I noticed that the line that it was ending on was the end of Scanner's internal buffer.
Here's a test program I wrote to simulate the situation:
public static void main(String[] args) throws IOException {
testString(1);
testString(1022);
}
private static void testString(int prefixLen) {
String suffix = "b\r\nX";
String buffer = new String(new char[prefixLen]).replace("\0", "a") + suffix;
Scanner scanner = new Scanner(buffer);
String pattern = "b\\R";
System.out.printf("=================\nTest String (Len=%d): '%s'\n'%s' found with horizon=0 (w/o bound): %s\n", buffer.length(), convertLineEndings(
buffer), pattern, convertLineEndings(scanner.findWithinHorizon(pattern, 0)));
System.out.printf("'X' found with horizon=1: %b\n", scanner.findWithinHorizon("X", 1) != null);
scanner.close();
}
private static String convertLineEndings(String string) {
return string.replaceAll("\\n", "\\\\n").replaceAll("\\r", "\\\\r");
}
... which produces this output (edited for formatting/brevity):
=================
Test String (Len=5): 'ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r\n
'X' found with horizon=1: true
=================
Test String (Len=1026): 'a ... ab\r\nX'
'b\R' found with horizon=0 (w/o bound): b\r
'X' found with horizon=1: false
To me, this looks like a bug! I think the scanner should match that suffix
with the patterns the same way independent of where they show up in the input text (as long as the prefix
doesn't get involved with the patterns). (I have also found possibly relevant Open JDK Bugs 8176407, and 8072582, but this was with regular Oracle JDK 8u111).
But I may have missed some recommendations regarding scanner or particular \R
pattern usage (or that Open JDK, and Oracle have identical(??) implementations for relevant classes here?)... hence the question!
BufferedReader simply reads the sequence of characters in a portion that depends on the buffer size. The scanner has a little buffer (1KB byte buffer). The scanner is slow as it does the parsing of input data. Moreover, It hides IOException. Unlike Scanner, BufferedReader simply reads the sequence of characters.
The Scanner next (Pattern pattern) method is usually used when we are interested at specific token patterns like if we want to get integers only. This method really helpful as well in catching invalid tokens. I will not go deeper into discussing regular expressions thus we have limited the discussion only in basic pattern.
The Scanner class is used to get user input, and it is found in the java.util package. To use the Scanner class, create an object of the class and use any of the available methods found in the Scanner class documentation. In our example, we will use the nextLine() method, which is used to read Strings:
It returns false if the buffer can't find an empty slot, that is, we can't overwrite unread values. Let's implement the offer method in Java: So, we're incrementing the write sequence and computing the index in the array for the next available slot. Then, we're writing the data to the buffer and storing the updated write sequence. 3.3. Poll
I tested this code at Ideone and it's no longer returning "false" on latest versions of Java.
https://www.ideone.com/4wwYSj
If, however, I were stuck on an old version or one which still exhibits the bug, and I needed a general purpose solution rather than a workaround for this one example, then I might try crafting a regex similar to \R
but which forces an extra byte peek in the \r
case. Note that the so-called "equivalent" pattern in the documentation is not truly equivalent, because it actually needs to be an atomic grouping. So you might end up with something like this:
(?>\u000D\u000A|[\u000A\u000B\u000C\u000D\u0085\u2028\u2029](?=.|\Z))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With