Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Verbs that act after backtracking and failure

Tags:

I was recently reading along in the PCRE - (Perl-compatible regular expressions) documentation and came across some interesting tricks with regular expression. As I continued to read and exhaust myself, I stopped because of some confusion in relation with using a few of the (*...) patterns.

My question and confusion relates to (*PRUNE) and (*FAIL)

Now for reference (*SKIP) acts like (*PRUNE), except that if the pattern is unanchored, the bumpalong advance is not to the next character, but to the position in the subject where (*SKIP) was encountered.

The documentation states that (*PRUNE) causes the match to fail at the current starting position in the subject if the rest of the pattern does not match. And it states (*FAIL) synonymous with (?!) negative assertion. Forces a matching failure at the given position in the pattern.

So basically (*FAIL) behaves like a failing negative assertion and is a synonym for (?!)

And (*PRUNE) causes the match to fail at the current starting position in the subject if there is a later matching failure that causes backtracking to reach it.

How are these different when it comes to a point of failing?

Can anyone provide examples of how these are implemented and used correctly?

like image 665
hwnd Avatar asked Nov 15 '13 03:11

hwnd


1 Answers

Before reading this answer, you should be familiar with the mechanism of backtracking, atomic groups, and possessive quantifiers. You can find information about these notions and features in the Friedl book and following these links: www.regular-expressions.info, www.rexegg.com

All the test has been made with a global search (with the preg_match_all() function).

(*FAIL) (or the shorthand (*F))

baabo caaco daado  caac(*FAIL)|aa.|caaco|co  [0] => aab [1] => caaco [2] => aad 

(*FAIL) causes exactly the same behavior as a "bad character" in a pattern. If you replace it with an "R", for example, you obtain exactly the same result: caacR|aa.|caaco|co. To be more general, you can indeed replace (*FAIL) with an "always failing subpattern" like: (?!), (?=a(?<!a)),...

a (first from "baabo") : Without surprise, the first result is found by the second alternative. (aab)

c (first) : The regex engine encounters the first "c" and tries the first alternative and finds: caac, but the subpattern is forced to fail. Then the regex engine (always from the first "c") tries the second alternative that fails, the third alternative succeeds. (caaco)

a (first from "daado") : the third result is found by the second alternative. (aad)

(*SKIP)

baabo caaco daado  caa(*SKIP)c(*FAIL)|aa.|caaco|co  [0] => aab [1] => co [2] => aad 

This verb defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later. Consequently, all the characters found before with the subpattern are consumed, once and for all, and can not be used for another part of the pattern (alternative).

a (first from "baabo") : the first result is found by the second alternative. (aab)

c (first) : the regex engine finds caac as in the first case, then fails (cause of the (*FAIL) verb), backtracks to the second "c" but is not allowed to backtrack to characters that are previously matched ("caa") before the (*SKIP) verb.
c (second) : Now, the regex engine tries always the first alternative but in this new position and fails since there is an "o" and not an "a" after, then it backtracks to this second "c". Note that in this case, these characters are not consumed as previously because the subpattern has failed before to have reached the (*SKIP) verb. The second alternative is tested and fails (doesn't begin with a "c"). The third alternative fails too because the next character is an "o" and not an "a". The fourth alternative succeeds and gives the second result. (co)

a (first from "daado") : the third result is found by the second alternative. (aad)

(*PRUNE)

baabo caaco daado  caa(*PRUNE)c(*FAIL)|aa.|caaco|co  [0] => aab [1] => aac [2] => aad 

This verb is different from (*SKIP) because it doesn't forbid using all the previous matched characters, but skips the first matched character by the subpattern (or forbids a subpattern to start with) if the subpattern will fail later.

a (first from "baabo") : the first result is found by the second alternative. (aab)

c (first) : the regex engine finds caac as in the first case, then fails, but now backtracks to the first "a" from "caaco" since the first "c" is skipped.
a (first from "caaco") : the first alternative is tried and fails, the second succeeds and gives the second result. (aac)

a (first from "daado") : the third result is found by the second alternative. (aad)

like image 131
Casimir et Hippolyte Avatar answered Sep 29 '22 12:09

Casimir et Hippolyte