Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do (*SKIP) or (*F) work on regex?

Tags:

regex

I'm learning an advanced usage of regex and noticed that many posts use (*SKIP) or (*F) in it.

I posted a question where the idea was to match lines that don't have yellow but has blue only if brown exists after blue. And the right answer was:

.*yellow.*(*SKIP)(*F)|^.*\bblue\b(?=.*brown).*$ 

I also have tried lookaround expressions like below but haven't worked for all the cases:

^((?!yellow).)*blue(?=.*brown).*$ 

I had no idea about these (*SKIP)(*F) flags, so the question is, how do these flag works? What do they do? And are there other flags like these?

Thanks.

like image 282
Federico Piazza Avatar asked Jul 02 '14 15:07

Federico Piazza


People also ask

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

What does F mean in regex?

\f stands for form feed, which is a special character used to instruct the printer to start a new page.


1 Answers

These two backtracking control verbs are implemented only in Perl, PCRE and the pypi regex module.

The idea of the (*SKIP)(*FAIL) trick is to consume characters that you want to avoid, and that must not be a part of the match result.

A classical pattern that uses of this trick looks like that:

What_I_want_to_avoid(*SKIP)(*FAIL)|What_I_want_to_match 

A regex engine processes a string like that:

  • the first token of the pattern is tested on each character from left to right (by default most of the time, but some regex engines can be set to work from right to left, .net can do this if I remember well)

  • if the first token matches, then the regex engine tests the next token of the pattern with the next characters (after the first token match) etc.

  • when a token fails, the regex engine gets the characters matched by the last token back and tries another way to make the pattern succeed (if it doesn't work too, the regex engine do the same with the previous token etc.)

When the regex engine meets the (*SKIP) verb (in this case all previous tokens have obviously succeeded), it has no right anymore to go back to all the previous tokens on the left and has no right anymore to retry all the matched characters with another branch of the pattern or at the next position in the string until the last matched character (included) if the pattern fails later on the right of the (*SKIP) verb.

The role of (*FAIL) is to force the pattern to fail. Thus all the characters matched on the left of (*SKIP) are skipped and the regex engine continues its job after these characters.

The only possibility for the pattern to succeed in the example pattern is that the first branch fails before (*SKIP) to allow the second branch to be tested.

You can find another kind of explanation here.

About Java and other regex engines that don't have these two features

Backtracking control verbs are not implemented in other regex engines and there are no equivalent.

However, you can use several ways to do the same (to be more clear, to avoid something that can be possibly matched by an other part of the pattern).

The use of capture groups:

way 1:

What_I_want_to_avoid|(What_I_want_to_match) 

You only need to extract the capture group 1 (or to test if it exists), since it is what you are looking for. If you use the pattern to perform a replacement, you can use the properties of the match result (offset, length, capture group) to make the replacement with classical string functions. Other language like javascript, ruby... allows to use a callback function as replacement.

way 2:

((?>To_avoid|Other_things_that_can_be_before_what_i_want)*)(What_I_want) 

It's the more easy way for the replacement, no need to callback function, the replacement string need only to begin with \1 (or $1)

The use of lookarounds:

example, you want to find a word that is not embedded between two other words (lets say S_word and E_word that are different (see Qtax comment)):

(the edge cases S_word E_word word E_word and S_word word S_word E_word are allowed in this example.)

The backtracking control verb way will be:

S_word not_S_word_or_E_word E_word(*SKIP)(*F)|word 

To use this way the regex engine needs to allow variable length lookbehinds to a certain extent. With .net or the new regex module, no problems, lookbehinds can have a totally variable length. It is possible with Java too but the size must be limited (example: (?<=.{1,1000})).

The Java equivalent will be:

word(?:(?!not_S_word_or_E_word E_word)|(?<!S_word not_E_word{0,1000} word)) 

Note that in some cases, only the lookahead is necessary. Note too that starting a pattern with literal character is more efficient than starting with a lookbehind, that's why I putted it after the word (even if I need to rewrite the word one more time in the assertion.)

like image 175
Casimir et Hippolyte Avatar answered Sep 29 '22 19:09

Casimir et Hippolyte