Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-greedy matching with grep

Tags:

regex

grep

gnu

bsd

Non greedy matching as far as I know is not part of Basic Regular Expression (BRE) and Extended Regular Expression (ERE). However, the behaviour on different versions of grep (BSD and GNU) seems to suggest other wise.

For example, let's take the following example. I have a string say:

string="hello_my_dear_polo"

Using GNU grep:

Following are few attempts to extract hello from the string.

BRE Attempt (fails):

$ grep -o "hel.*\?o" <<< "$string"
hello_my_dear_polo

Output yields entire string which suggest the non-greedy quantifier does not work on BRE. Note that I have only escaped ? since * does not lose it's meaning and need not be escaped.

ERE Attempt (fails):

$ grep -oE "hel.*?o" <<< "$string"
hello_my_dear_polo

Enabling the -E option also yields the same output suggesting that non-greedy matching is not part of ERE. Escaping was not needed here since we are using ERE.

PCRE Attempt (succeeds):

$ grep -oP "hel.*?o" <<< "$string"
hello

Enabling the -P option for PCRE suggests that non-greedy quantifier is a part of it and hence we get the desired output of hello. Escaping was not needed here since we are using PCRE.

Using BSD grep:

Here are few attempts to extract hello from the string.

BRE Attempt (fails):

$ grep -o "hel.*\?o" <<< "$string"

Using BRE I get no output from BSD grep.

ERE Attempt (succeeds):

$ grep -oE "hel.*?o" <<< "$string"
hello

After enabling the -E option, I am surprised that I was able to extract my desired output. My question is on the output I am getting from this attempt.

PCRE Attempt (fails):

$ grep -oP "hel.*?o" <<< "$string"
usage: grep [-abcDEFGHhIiJLlmnOoPqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
    [-e pattern] [-f file] [--binary-files=value] [--color=when]
    [--context[=num]] [--directories=action] [--label] [--line-buffered]
    [--null] [pattern] [file ...]

Using -P option gave me usage error which was expected since BSD option of grep does not support PCRE.

So my question is why would using ERE on BSD grep yield correct output with using non-greedy quantifier but not with GNU grep.

Is this a bug, an un-documented feature of BSD egrep or my mis-understanding of the output?

like image 224
jaypal singh Avatar asked May 04 '14 08:05

jaypal singh


1 Answers

The double quantifier is simply a syntax error and could result in either an error message or undefined behavior. It would arguably be better if you got an error message.

Perl extensions to regex post-date POSIX by a large margin; at the time these tools were written, it was extremely unlikely that someone would try to use this wacky syntax for anything. Greedy matching was only introduced in Perl 5, in the mid-1990s.

like image 189
tripleee Avatar answered Oct 16 '22 17:10

tripleee