Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding negative look aheads in regular expressions

Tags:

regex

ruby

I want to match urls that do NOT contain the string 'localhost' using Ruby regex

Based on answers and comments here, I put together two solutions, both of which seem to work:

Solution A:

(?!.*localhost)^.*$ 

Example: http://rubular.com/r/tQtbWacl3g

Solution B:

^((?!localhost).)*$ 

Example: http://rubular.com/r/2KKnQZUMwf

The problem is that I don't understand what they're doing. For example, according to the docs, ^ can be used in various ways:

[^abc]  Any single character except: a, b, or c  
^ Start of line  

But I don't get how it's being applied here.

Can someone breakdown these expressions for me, and how they differ from one another?

like image 700
Yarin Avatar asked Dec 11 '22 12:12

Yarin


2 Answers

In both of your cases, ^ is just the start of the line (since it's not used inside a character class). Since both ^ and the lookahead are zero-width assertions, we can switch them around in the first case - I think that makes it a bit easier to explain:

^(?!.*localhost).*$ 

The ^ anchors the expression to the beginning of the string. The lookahead then starts from that position and tries to find localhost anywhere the string (the "anywhere" is taken care of by the .* in front of localhost). If that localhost can be found, the subexpression of the lookahead matches and therefore the negative lookahead causes the pattern to fail. Since the lookahead is bound to start at the beginning of the string by the adjacent ^ this means, the pattern overall cannot match. If, however the .*localhost does not match (and hence localhost does not occur in the string), the lookahead succeeds, and the .*$ simply takes care of matching the rest of the string.

Now the other one

^((?!localhost).)*$

This time the lookahead only checks at the current position (there is no .* inside it). But the lookahead is repeated for every single character. This way it does check every single position again. Here is roughly what happens: the ^ makes sure that we're starting at the beginning of the string again. The lookahead checks whether the word localhost is found at that position. If not, all is well, and . consumes one character. The * then repeats both of those steps. We are now one character further in the string, and the lookahead checks whether the second character starts the word localhost - again, if not, all is well, and . consumes another character. This is done for every single character in the string, until we reach the end.

In this particular case both methods are equivalent, and you could select one based on performance (if it matters) or readability (if not; probably the first one). However, in other cases the second variant is preferable, because it allows you to do this repetition for a fixed part of the string, whereas the first variant will always check the entire string.

like image 82
Martin Ender Avatar answered Feb 24 '23 21:02

Martin Ender


You can get them easily explained online. The first:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
    .*                       any character except \n (0 or more times
                             (matching the most amount possible))
--------------------------------------------------------------------------------
    localhost                'localhost'
--------------------------------------------------------------------------------
  )                        end of look-ahead
--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
--------------------------------------------------------------------------------
                           ' '

And the second:

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  ^                        the beginning of the string
--------------------------------------------------------------------------------
  (                        group and capture to \1 (0 or more times
                           (matching the most amount possible)):
--------------------------------------------------------------------------------
    (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
      localhost                'localhost'
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
    .                        any character except \n
--------------------------------------------------------------------------------
  )*                       end of \1 (NOTE: because you are using a
                           quantifier on this capture, only the LAST
                           repetition of the captured pattern will be
                           stored in \1)
--------------------------------------------------------------------------------
  $                        before an optional \n, and the end of the
                           string
--------------------------------------------------------------------------------
like image 22
Carl Norum Avatar answered Feb 24 '23 19:02

Carl Norum