Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular Expressions: '\<' vs ' \b'

Tags:

regex

linux

bash

Currently prepping for RHCSA and learning regex. What's the difference between \b and \< ?

They seem to do almost the exact same thing: Match the string in between the backslashes.
Example:

[root@RHEL8DEV etc]# grep '\<root\>' * 2>/dev/null  | wc
    105     327    3658
[root@RHEL8DEV etc]# grep '\broot\b' * 2>/dev/null  | wc
    105     327    3658

Even after reading on gnu.org, I'm still scratching my head.


Using \b

  • \b matches the empty string, but only at the beginning or end of a word. Thus, \bfoo\b matches any occurrence of foo as a separate word. \bballs?\b matches ball or balls as a separate word. \b matches at the beginning or end of the buffer regardless of what text appears next to it.

Using \< and \>

  • \< atches the empty string, but only at the beginning of a word. \< matches at the beginning of the buffer only if a word-constituent character follows.
  • \> matches the empty string, but only at the end of a word. \> matches at the end of the buffer only if the contents end with a word-constituent character.

Thanks for taking time to read this.

like image 218
hixavier Avatar asked Oct 15 '25 18:10

hixavier


2 Answers

Only the manual page for your specific version of grep can reveal whether they are exactly equivalent. Neither is fully portable.

Traditionally, \< would only match at a lef word boundary, and \> at a right one, in some versions of egrep. (However, e.g. Procmail took a shortcut, and actually defines both identically.)

\b is a newer construct from Perl et al., and is direction neutral, i.e. it is true at a word boundary either on the left or on the right of a sequence of word characters.

like image 115
tripleee Avatar answered Oct 17 '25 08:10

tripleee


I've personally found \b to be more broadly supported than \< and \>. The only exceptions I've encountered is that vim and BSD sed support \< and \> without \b.

As to their definitions: in PCRE, it's essentially

  • \< = (?<!\w)(?=\w) = word character on the right but not on the left
  • \> = (?<=\w)(?!\w) = word character on the left but not on the right
  • \b = (?:(?<!\w)(?=\w)|(?<=\w)(?!\w)) = either of the above

Those links point to Regex101's explanations of these regexes. Note that none of that site's four supported engines understand what \< and \> are supposed to do.

Since PCRE explicitly prohibits special meanings to non-alphanumeric escapes, \< means "literal open angle-bracket" and therefore (?:\<|\>) means [<>] rather than \b. Standard Extended Regular Expressions do not have this explicit prohibition, though they also do not implement any such special meanings (items like \< and \> are non-standard extensions).

Also note that inside a character class, things differ. In most regex interpreters, [\b] means "literal backspace character" and is equivalent to [\010] or [\x08] (or \010 or \x08). Putting a zero-width item into a character class doesn't make any sense anyway.

An example of the differences, using GNU grep, which accepts both formats:

$ echo yes |grep '\<yes'
yes
$ echo yes |grep '\byes'
yes
$ echo yes |grep '\>yes'
# (no output here means it failed)
$ 

Here you can see that the directionality matters for \< and \> but not for \b


Various support tests, command-line only (Debian Testing as of 2019/11/25 or FreeBSD 11.2 as noted):

$ echo y |grep '\<y'       # GNU grep w/ BRE, Basic Regular Expression
y
$ echo y |grep -E '\<y'    # GNU grep w/ ERE, Extended Regular Expression
y
$ echo y |grep -P '\<y'    # GNU grep w/ libpcre, Perl-Compatible Regular Expression
$ echo y |perl -ne 'print if /\<y/'  # perl proper
$ echo y |sed '/\<y/!d'    # GNU sed with BRE
y
$ echo y |sed -r '/\<y/!d' # GNU sed with ERE
y
$ echo y |sed '/\<y/!d'    # BSD sed with BRE (FreeBSD 11.2)
y
$ echo y |sed -E '/\<y/!d' # BSD sed with ERE (FreeBSD 11.2)
y
$ echo y |gawk '/\<y/'     # GNU awk
y
$ echo y |mawk '/\<y/'     # More POSIX-aligned
$ 

# python test (result printed as an array, in this case empty for no matches)
$ echo y |python -c 'import re,sys; print re.findall(r"\<y", sys.stdin.read())'
[]

grep -P (which uses libpcre, not always compiled into grep) does not match because PCRE doesn't recognize \< as anything but a literal < character.

$ echo y |grep '\by'       # GNU grep w/ BRE, Basic regex
y
$ echo y |grep -E '\by'    # GNU grep w/ ERE, Extended regex
y
$ echo y |grep -P '\by'    # GNU grep w/ libpcre, Perl-compatible regex
y
$ echo y |perl -ne 'print if /\by/'  # perl proper 
y
$ echo y |sed '/\by/!d'    # GNU sed with BRE
y
$ echo y |sed -r '/\by/!d' # GNU sed with ERE
y
$ echo y |sed '/\by/!d'    # BSD sed with BRE (FreeBSD 11.2)
$ echo y |sed -E '/\by/!d' # BSD sed with ERE (FreeBSD 11.2)
$ echo y |gawk '/\by/'     # GNU awk
$ echo y |mawk '/\by/'     # POSIX-ish awk
$ 

# python test
$ echo y |python -c 'import re,sys; print re.findall(r"\by", sys.stdin.read())'
['y']

Note how BSD sed accepts \< but not \b yet GNU sed accepts both.

like image 22
Adam Katz Avatar answered Oct 17 '25 09:10

Adam Katz