Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

order of word boundaries and anchors in PCRE

Tags:

regex

pcre

Are the following couples of expressions equivalent in PCRE?

  1. For ^ : \b^<some-regex>\b and ^\b<some-regex>\b (e.g.: \b^[a-z]\b and ^\b[a-z]\b)

  2. For $ : \b<some-regex>$\b and \b<some-regex>\b$ (e.g.: \b[a-z]$\b and \b[a-z]\b$)

  3. For the combination of ^ and $ :

  • \b^<some-regex>$\b and ^\b<some-regex>\b$ (e.g.: \b^[a-z]$\b and ^\b[a-z]\b$)

I tested all of the options above and couldn't find any difference in matching. If they're not equivalent - please give an example input that matches one but not the other.

like image 479
Benny Brudner Avatar asked Dec 06 '25 15:12

Benny Brudner


1 Answers

Anchors and word boundaries are non-consuming patterns. That means, that the regex index stays at the same position inside the string after evaluating the anchor or a word boundary.

In a \b$ pattern, the regex engine ensures the current position is a word boundary position, and, staying at the same position, also checks if it is the end of the string.

In a $\b pattern, the regex engine first ensures the current position is the end of the string, and then, staying at the same position, also checks if it is the word boundary position.

So, \b$ equals $\b.

The same applies to ^\b and \b^ (where ^ matches the start of a string position).

You might have heard that lookarounds are non-consuming, and yes, that is true. \b can actually be paraphrased as a (?<!\w)(?=\w)|(?<=\w)(?!\w) lookaround alternation. ^ and $ are trickier, but you must understand that ^ = (?=^) and $ equals (?=$).

like image 105
Wiktor Stribiżew Avatar answered Dec 08 '25 07:12

Wiktor Stribiżew



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!