Perl: Matching string not containing PATTERN

Tags:

perl

While using Perl regex to chop a string down into usable pieces I had the need to match everything except a certain pattern. I solved it after I found this hint on Perl Monks:

/^(?:(?!PATTERN).)*$/;    # Matches strings not containing PATTERN

Although I solved my initial problem, I have little clue about how it actually works. I checked perlre, but it is a bit too formal to grasp.

Regular expression to match a line that doesn't contain a word? helps a lot in understanding, but why is the . in my example and the ?: and how do the outer parentheses work?

Can someone break up the regex and explain in simple words how it works?

237

asked May 01 '14 07:05

2 Answers

Building it up piece by piece (and throughout assuming no newlines in the string or PATTERN):

This matches any string:

/^.*$/

But we don't want . to match a character that starts PATTERN, so replace

with

(?!PATTERN).

This uses a negative look-ahead that tests a given pattern without actually consuming any of the string and only succeeds if the pattern does not match at the given point in the string. So it's like saying:

if PATTERN doesn't match at this point,
    match the next character

This needs to be done for every character in the string, so * is used to match zero or more times, from the beginning to the end of the string.

To make the * apply to the combination of the negative look-ahead and ., not just the ., it needs to be surrounded by parentheses, and since there's no reason to capture, they should be non-capturing parentheses (?: ):

(?:(?!PATTERN).)*

And putting back the anchors to make sure we test at every position in the string:

/^(?:(?!PATTERN).)*$/

Note that this solution is particularly useful as part of a larger match; e.g. to match any string with foo and later baz but no bar in between:

/foo(?:(?!bar).)*baz/

If there aren't such considerations, you can simply do:

/^(?!.*PATTERN)/

to check that PATTERN does not match anywhere in the string.

About newlines: there are two problems with your regex and newlines. First, . doesn't match newlines, so "foo\nbar" =~ /^(?:(?!baz).)*$/ doesn't match, even though the string does not contain baz. You need to add the /s flag to make . match any character; "foo\nbar" =~ /^(?:(?!baz).)*$/s correctly matches. Second, $ doesn't match just at the end of the string, it also can match before a newline at the end of the string. So "foo\n" =~ /^(?:(?!\s).)*$/s does match, even though the string contains whitespace and you are attempting to only match strings with no whitespace; \z always only matches at the end, so "foo\n" =~ /^(?:(?!\s).)*\z/s correctly fails to match the string that does in fact contain a \s. So the correct general purpose regex is:

/^(?:(?!PATTERN).)*\z/s

149

answered Sep 24 '22 21:09

ysth

jippie, first, here's a tip. If you see a regex that is not immediately obvious to you, you can dump it in a tool that explains every token.

For instance, here is the RegexBuddy output:

"
^                # Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed)
(?:              # Match the regular expression below
   (?!              # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
      PATTERN          # Match the character string “PATTERN” literally (case insensitive)
   )
   .                # Match any single character that is NOT a line break character (line feed)
)
   *                # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\$                # Assert position at the end of a line (at the end of the string or before a line break character) (line feed)
                    # Perl 5.18 allows a zero-length match at the position where the previous match ends.
                    # Perl 5.18 attempts the next match at the same position as the previous match if it was zero-length and may find a non-zero-length match at the same position.
"

Some people also use regex101.

A Human Explanation

Now if I had to explain the regex, I would not be so linear. I would start by saying that it is fully anchored by the ^ and the $, implying that the only possible match is the whole string, not a substring of that string.

Then we come to the meat: a non-capturing group introduced by (?: and repeated any number of times by the *

What does this group do? It contains

a negative lookahead (you may want to read up on lookarounds here) asserting that at this exact position in the string, we cannot match the word PATTERN,
then a dot to match the next character

This means that at each position in the string, we assert that we cannot match PATTERN, then we match the next character.

If PATTERN can be matched anywhere, the negative lookahead fails, and so does the entire regex.

answered Sep 23 '22 21:09

zx81

Related questions
                            
                                What is the C# equivalent of java.util.regex?
                            
                                Selecting a specific div from a extern webpage using CURL
                            
                                Perl equivalent of PHP's preg_callback
                            
                                Embedding evaluations in Perl regex
                            
                                how to change ^M to new line in vim? [duplicate]
                            
                                Why doesn't this regex work as expected in Java?
                            
                                Replace only first match in multiple files with perl
                            
                                Delete line starting with a word in Javascript using regex
                            
                                Remove string between 2 characters from text string
                            
                                Mysql optimization for REGEXP
                            
                                R padding 0's inside a string
                            
                                Converting shell wildcards to regex
                            
                                Easily aligning characters after whitespace in vim
                            
                                remove all empty values from url
                            
                                Invalid escape sequence (valid ones are \b \t \n \f \r \" \' \ )
                            
                                How can I detect laughing words in a string?
                            
                                Regex match including new line
                            
                                regex to get date yyyy-mm-dd from any string
                            
                                validate datetime with javascript and regex
                            
                                How to use regular expression with ANY array operator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Perl: Matching string not containing PATTERN

Tags:

regex

perl

jippie

People also ask

2 Answers

ysth

zx81

Recent Activity

Donate For Us