I’m wondering what is the reason of providing special regular-expression constructs for the following characters:
\t - The tab character ('\u0009')
\n - The newline (line feed) character ('\u000A')
\r - The carriage-return character ('\u000D')
\f - The form-feed character ('\u000C')
and, on the other hand, not providing one for backspace character (\b).
As it is shown in this question, there is definitely a difference between "\\n" compared to "\n" or "\\t" compared to "\t", when Pattern.COMMENTS flag is used, but I think it doesn't answer the question, why there is no regular expression construct for backspace character.
Isn't there any possible use case for a regular expression construct for backspace character, not only when Pattern.COMMENTS flag is set as active, but maybe in other cases that I don't know yet? Why backspace character is considered as different comparing to other whitespace characters listed above that lead to decision of not providing a regular expression construct for backspace character?
Java regex originated from Perl regex, where most shorthand classes have already been defined. Since Perl regex users got accustomed to use "\\b" as a word boundary change already accepted and well-known shorthands. "\\b" in Perl regex matches a word boundary, and it came with this meaning to Java regex. See this Java regex documentation:
The string literal
"\b", for example, matches a single backspace character when interpreted as a regular expression, while"\\b"matches a word boundary.
Currently, you can't even make "\\b" act as a backspace inside a character set (as in some other languages, e.g. in Python), it is done specifically to avoid human errors when writing patterns. According to the latest specs
It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language.
If you have to use a regex escape for a backspace, use a Unicode regex escape "\\u0008":
Java online demo:
String s = "word1 and\bword2";
System.out.println(Arrays.toString(s.split("\\b"))); // WB
// => [word1, , and, , word2]
System.out.println(Arrays.toString(s.split("\b"))); // BS
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("[\b]"))); // BS in a char set
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("\\u0008"))); // BS as a Unicode regex escape
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("[\\b]")));// WB NOT treated as BS in a char set
// => java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With