In my home directory I have a folder drupal-6.14 that contains the Drupal platform.
From this directory I use the following command:
find drupal-6.14 -type f -iname '*' | grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' | xargs tar -czf drupal-6.14.tar.gz
What this command does is gzips the folder drupal-6.14, excluding all subfolders of drupal-6.14/sites/ except sites/all and sites/default, which it includes.
My question is on the regular expression:
grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*'
The expression works to exclude all the folders I want excluded, but I don't quite understand why.
It is a common task using regular expressions to
Match all strings, except those that don't contain subpattern x. Or in other words, negating a subpattern.
I (think) I understand that the general strategy to solve these problems is the use of negative lookaheads, but I've never understood to a satisfactory level how positive and negative look(ahead/behind)s work.
Over the years, I've read many websites on them. The PHP and Python regex manuals, other pages like http://www.regular-expressions.info/lookaround.html and so forth, but I've never really had a solid understanding of them.
Could someone explain, how this is working, and perhaps provide some similar examples that would do similar things?
-- Update One:
Regarding Andomar's response: can a double negative lookahead be more succinctly expressed as a single positive lookahead statement:
i.e Is:
'drupal-6.14/(?!sites(?!/all|/default)).*'
equivalent to:
'drupal-6.14/(?=sites(?:/all|/default)).*'
???
-- Update Two:
As per @andomar and @alan moore - you can't interchange double negative lookahead for positive lookahead.
In this type of lookahead the regex engine searches for a particular element which may be a character or characters or a group after the item matched. If that particular element is not present then the regex declares the match as a match otherwise it simply rejects that match.
Positive and Negative Lookahead Negative lookahead provides the solution: q(?! u). The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u.
Lookahead is used as an assertion in Python regular expressions to determine success or failure whether the pattern is ahead i.e to the right of the parser's current position. They don't match anything. Hence, they are called as zero-width assertions.
The syntax is: X(?= Y) , it means "look for X , but match only if followed by Y ". There may be any pattern instead of X and Y . For an integer number followed by € , the regexp will be \d+(?=
A negative lookahead says, at this position, the following regex can not match.
Let's take a simplified example:
a(?!b(?!c)) a Match: (?!b) succeeds ac Match: (?!b) succeeds ab No match: (?!b(?!c)) fails abe No match: (?!b(?!c)) fails abc Match: (?!b(?!c)) succeeds
The last example is a double negation: it allows b
followed by c
. The nested negative lookahead becomes a positive lookahead: the c
should be present.
In each example, only the a
is matched. The lookahead is only a condition, and does not add to the matched text.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With