Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regular expression negative lookahead

In my home directory I have a folder drupal-6.14 that contains the Drupal platform.

From this directory I use the following command:

find drupal-6.14 -type f -iname '*' | grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' | xargs tar -czf drupal-6.14.tar.gz 

What this command does is gzips the folder drupal-6.14, excluding all subfolders of drupal-6.14/sites/ except sites/all and sites/default, which it includes.

My question is on the regular expression:

grep -P 'drupal-6.14/(?!sites(?!/all|/default)).*' 

The expression works to exclude all the folders I want excluded, but I don't quite understand why.

It is a common task using regular expressions to

Match all strings, except those that don't contain subpattern x. Or in other words, negating a subpattern.

I (think) I understand that the general strategy to solve these problems is the use of negative lookaheads, but I've never understood to a satisfactory level how positive and negative look(ahead/behind)s work.

Over the years, I've read many websites on them. The PHP and Python regex manuals, other pages like http://www.regular-expressions.info/lookaround.html and so forth, but I've never really had a solid understanding of them.

Could someone explain, how this is working, and perhaps provide some similar examples that would do similar things?

-- Update One:

Regarding Andomar's response: can a double negative lookahead be more succinctly expressed as a single positive lookahead statement:

i.e Is:

'drupal-6.14/(?!sites(?!/all|/default)).*' 

equivalent to:

'drupal-6.14/(?=sites(?:/all|/default)).*' 

???

-- Update Two:

As per @andomar and @alan moore - you can't interchange double negative lookahead for positive lookahead.

like image 950
themesandmodules Avatar asked Nov 17 '09 14:11

themesandmodules


People also ask

What is a negative lookahead in regex?

In this type of lookahead the regex engine searches for a particular element which may be a character or characters or a group after the item matched. If that particular element is not present then the regex declares the match as a match otherwise it simply rejects that match.

How do you use negative look ahead?

Positive and Negative Lookahead Negative lookahead provides the solution: q(?! u). The negative lookahead construct is the pair of parentheses, with the opening parenthesis followed by a question mark and an exclamation point. Inside the lookahead, we have the trivial regex u.

What is regex lookahead?

Lookahead is used as an assertion in Python regular expressions to determine success or failure whether the pattern is ahead i.e to the right of the parser's current position. They don't match anything. Hence, they are called as zero-width assertions.

What is lookahead in regex JavaScript?

The syntax is: X(?= Y) , it means "look for X , but match only if followed by Y ". There may be any pattern instead of X and Y . For an integer number followed by € , the regexp will be \d+(?=


1 Answers

A negative lookahead says, at this position, the following regex can not match.

Let's take a simplified example:

a(?!b(?!c))  a      Match: (?!b) succeeds ac     Match: (?!b) succeeds ab     No match: (?!b(?!c)) fails abe    No match: (?!b(?!c)) fails abc    Match: (?!b(?!c)) succeeds 

The last example is a double negation: it allows b followed by c. The nested negative lookahead becomes a positive lookahead: the c should be present.

In each example, only the a is matched. The lookahead is only a condition, and does not add to the matched text.

like image 50
Andomar Avatar answered Sep 17 '22 15:09

Andomar