Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl: Matching string not containing PATTERN

Tags:

regex

perl

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:

/^(?:(?!PATTERN).)*$/;    # Matches strings not containing PATTERN

Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.

Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?

Can someone break up the regex and explain in simple words how it works?

like image 237
jippie Avatar asked May 01 '14 07:05

jippie


People also ask

What does =~ do in Perl?

The operator =~ associates the string with the regex match and produces a true value if the regex matched, or false if the regex did not match.

How do I match a string in Perl?

m operator in Perl is used to match a pattern within the given text. The string passed to m operator can be enclosed within any character which will be used as a delimiter to regular expressions.

What is \s in Perl regex?

The substitution operator, s///, is really just an extension of the match operator that allows you to replace the text matched with some new text. The basic form of the operator is − s/PATTERN/REPLACEMENT/; The PATTERN is the regular expression for the text that we are looking for.

Does not contain in Perl?

FreeKB - Perl (Scripting) Variable contains (=~) or does not contain (!~) The built in Perl operator =~ is used to determine if a string contains a string, like this. The !~ operator is used to determine if a string does not contains a string, like this. Often, variables are used instead of strings.


2 Answers

Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):

This matches any string:

/^.*$/

But we don't want . to match a character that starts PATTERN, so replace

.

with

(?!PATTERN).

This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:

if PATTERN doesn't match at this point,
    match the next character

This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.

To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):

(?:(?!PATTERN).)*

And putting back the anchors to make sure we test at every position in the string:

/^(?:(?!PATTERN).)*$/

Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:

/foo(?:(?!bar).)*baz/

If there aren't such considerations, you can simply do:

/^(?!.*PATTERN)/

to check that PATTERN does not match anywhere in the string.

About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:

/^(?:(?!PATTERN).)*\z/s
like image 149
ysth Avatar answered Sep 24 '22 21:09

ysth


jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.

For instance, here is the RegexBuddy output:

"
^                # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?:              # Match the regular expression below
   (?!              # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
      PATTERN          # Match the character string “PATTERN” literally (case insensitive)
   )
   .                # Match any single character that is NOT a line break character (line feed)
)
   *                # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$                # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
                    # Perl 5.18 allows a zero-length match at the position where the previous match ends.
                    # Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"

Some people also use regex101.

A Human Explanation

Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.

Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *

What does this group do? It contains

  1. a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
  2. then a dot to match the next character

This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.

If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

like image 25
zx81 Avatar answered Sep 23 '22 21:09

zx81