Regexp Backslash - GNU Emacs Manual says that \<
matches at the beginning of a word, \>
matches at the end of a word, and \b
matches a word boundary. \b
is just as in other non-Emacs regular expressions. But it seems that \<
and \>
are particular to Emacs regular expressions. Are there cases where \<
and \>
are needed instead of \b
? For instance, \bword\b
would match the same as \<word\>
would, and the only difference is that the latter is more readable.
Simply put: \b allows you to perform a “whole words only” search using a regular expression in the form of \bword\b. A “word character” is a character that can be used to form words. All characters that are not “word characters” are “non-word characters”.
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).
The basic purpose of non-word-boundary is to created a regex that says: if we are at the beginning/end of a word char ( \w = [a-zA-Z0-9_] ) make sure the previous/next character is also a word char , e.g.: "a\B." ~ "a\w" : "ab" , "a4" , "a_" , ... but not "a " , "a."
i) makes the regex case insensitive. (? s) for "single line mode" makes the dot match all characters, including line breaks.
You can get unexpected results if you assume they behave the same..
What can \< and > that \b can do?
The answer is that \<
and\>
are explicit... This end of a word! and only this end!\b
is general.... Either end of a word will match...
GNU Operators * Word Operators
line="cat dog sky"
echo "$line" |sed -n "s/\(.*\)\b\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\>\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\<\(.*\)/# |\1|\2|/p"
echo
line="cat dog sky"
echo "$line" |sed -n "s/\(.*\)\b\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\>\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\<\(.*\)/# |\1|\2|/p"
echo
line="cat dog sky "
echo "$line" |sed -n "s/\(.*\)\b\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\>\(.*\)/# |\1|\2|/p"
echo "$line" |sed -n "s/\(.*\)\<\(.*\)/# |\1|\2|/p"
echo
output
# |cat dog |sky|
# |cat dog| sky|
# |cat dog |sky|
# |cat dog |sky|
# |cat dog| sky|
# |cat dog |sky|
# |cat dog sky| |
# |cat dog sky| |
# |cat dog |sky |
It looks to me like \<.*?\>
would match only series of word characters, while \b.*?\b
would match either series of word characters or a series non-word characters, since it can also accept the end of a word, and then the beginning of one. If you force the expression between the two to be a word, they do indeed act the same.
Of course, you could replicate the behavior of \<
and \>
with \b\w
and \w\b
. So I guess the answer is that yes, it's mostly for readability. Then again, isn't that what most escape characters in regular expression are for?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With