In HACKERRANK this line of code occurs very frequently. I think this is to skip whitespaces but what does that "\r\u2028\u2029\u0085"
thing mean
scanner.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
Scanner.skip skips a input which matches the pattern, here the pattern is :-
\n newline
\u2028 matches the character with index 2018 base 16(8232 base 10 or 20050 base 8) case sensitive
1st Alternative \r\n
2nd Alternative [\n\r\u2028\u2029\u0085]
Skip \r\n
is for Windows.
The rest is standard \r=CR
, \n=LF
(see \r\n , \r , \n what is the difference between them?)
Then some Unicode special characters:
u2028 = LINE SEPARATOR
(https://www.fileformat.info/info/unicode/char/2028/index.htm)
u2029 = PARAGRAPH SEPARATOR
(http://www.fileformat.info/info/unicode/char/2029/index.htm)
u0085 = NEXT LINE
(https://www.fileformat.info/info/unicode/char/0085/index.htm)
OpenJDK's source code shows that nextLine() uses this regex for line separators:
private static final String LINE_SEPARATOR_PATTERN = "\r\n|[\n\r\u2028\u2029\u0085]";
\r\n
is a Windows line ending.\n
is a UNIX line ending.\r
is a Macintosh (pre-OSX) line ending.\u2028
is LINE SEPARATOR.\u2029
is PARAGRAPH SEPARATOR.\u0085
is NEXT LINE (NEL).The whole thing is a regex expression, so you could simply drop it into https://regexr.com or https://regex101.com/ and it will provided you with a full description of what each part of the regex means.
Here it is for you though:
(\r\n|[\n\r\u2028\u2029\u0085])? / gm
1st Capturing Group (\r\n|[\n\r\u2028\u2029\u0085])?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
1st Alternative \r\n
\r matches a carriage return (ASCII 13)
\n matches a line-feed (newline) character (ASCII 10)
2nd Alternative [\n\r\u2028\u2029\u0085]
Match a single character present in the list below
[\n\r\u2028\u2029\u0085]
\n matches a line-feed (newline) character (ASCII 10)
\r matches a carriage return (ASCII 13)
\u2028 matches the character with index 202816 (823210 or 200508) literally (case sensitive)
\u2029 matches the character with index 202916 (823310 or 200518) literally (case sensitive)
\u0085 matches the character with index 8516 (13310 or 2058) literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)
m modifier: multi line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
As for scanner.skip
this does (Scanner Pattern Tutorial):
The java.util.Scanner.skip(Pattern pattern) method skips input that matches the specified pattern, ignoring delimiters. This method will skip input if an anchored match of the specified pattern succeeds.If a match to the specified pattern is not found at the current position, then no input is skipped and a NoSuchElementException is thrown.
I would also recommend reading Alan Moore's
answer on here RegEx in Java: how to deal with newline he talks about new ways in Java 1.8.
scanner.skip("(\r\n|[\n\r\u2028\u2029\u0085])?");
u0085 NEXT LINE (NEL)
U2029 PARAGRAPH SEPARATOR
U2028 LINE SEPARATOR'
The whole logic behind this is to remove the extra space and extra new line when input is from scanner
There's already a similar question here scanner.skip. It won't skip whitespaces since the unicode char for it is not present (u0020)
\r = CR (Carriage Return) // Used as a new line character in Mac OS before X
\n = LF (Line Feed) // Used as a new line character in Unix/Mac OS X
\r\n = CR + LF // Used as a new line character in Windows
u2028 = line separator
u2029 = paragraph separator
u0085 = next line
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With