Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why there is no special regular expression construct for backspace character ("\b") like for \\t, \\n, \\r, and \\f in Java?

Tags:

java

regex

I’m wondering what is the reason of providing special regular-expression constructs for the following characters:

\t - The tab character ('\u0009')

\n - The newline (line feed) character ('\u000A')

\r - The carriage-return character ('\u000D')

\f - The form-feed character ('\u000C')

and, on the other hand, not providing one for backspace character (\b).

As it is shown in this question, there is definitely a difference between "\\n" compared to "\n" or "\\t" compared to "\t", when Pattern.COMMENTS flag is used, but I think it doesn't answer the question, why there is no regular expression construct for backspace character.

Isn't there any possible use case for a regular expression construct for backspace character, not only when Pattern.COMMENTS flag is set as active, but maybe in other cases that I don't know yet? Why backspace character is considered as different comparing to other whitespace characters listed above that lead to decision of not providing a regular expression construct for backspace character?

like image 645
Przemysław Moskal Avatar asked Jan 19 '26 17:01

Przemysław Moskal


1 Answers

Java regex originated from Perl regex, where most shorthand classes have already been defined. Since Perl regex users got accustomed to use "\\b" as a word boundary change already accepted and well-known shorthands. "\\b" in Perl regex matches a word boundary, and it came with this meaning to Java regex. See this Java regex documentation:

The string literal "\b", for example, matches a single backspace character when interpreted as a regular expression, while "\\b" matches a word boundary.

Currently, you can't even make "\\b" act as a backspace inside a character set (as in some other languages, e.g. in Python), it is done specifically to avoid human errors when writing patterns. According to the latest specs

It is an error to use a backslash prior to any alphabetic character that does not denote an escaped construct; these are reserved for future extensions to the regular-expression language.

If you have to use a regex escape for a backspace, use a Unicode regex escape "\\u0008":

Java online demo:

String s = "word1 and\bword2";
System.out.println(Arrays.toString(s.split("\\b")));  // WB
// => [word1,  , and, , word2]
System.out.println(Arrays.toString(s.split("\b")));   // BS
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("[\b]"))); // BS in a char set
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("\\u0008"))); // BS as a Unicode regex escape
// => [word1 and, word2]
System.out.println(Arrays.toString(s.split("[\\b]")));// WB NOT treated as BS in a char set
// => java.util.regex.PatternSyntaxException: Illegal/unsupported escape sequence near index 2
like image 52
Wiktor Stribiżew Avatar answered Jan 21 '26 06:01

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!