Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extending regular expression syntax to say 'does not contain text XYZ'

I have an app where users can specify regular expressions in a number of places. These are used while running the app to check if text (e.g. URLs and HTML) matches the regexes. Often the users want to be able to say where the text matches ABC and does not match XYZ. To make it easy for them to do this I am thinking of extending regular expression syntax within my app with a way to say 'and does not contain pattern'. Any suggestions on a good way to do this?

My app is written in C# .NET 3.5.

My plan (before I got the awesome answers to this question...)

Currently I'm thinking of using the ¬ character: anything before the ¬ character is a normal regular expression, anything after the ¬ character is a regular expression that can not match in the text to be tested.

So I might use some regexes like this (contrived) example:

on (this|that|these) day(s)?¬(every|all) day(s) ?

Which for example would match 'on this day the man said...' but would not match 'on this day and every day after there will be ...'.

In my code that processes the regex I'll simply split out the two parts of the regex and process them separately, e.g.:

    public bool IsMatchExtended(string textToTest, string extendedRegex)
    {
        int notPosition = extendedRegex.IndexOf('¬');

        // Just a normal regex:
        if (notPosition==-1)
            return Regex.IsMatch(textToTest, extendedRegex);

        // Use a positive (normal) regex and a negative one
        string positiveRegex = extendedRegex.Substring(0, notPosition);
        string negativeRegex = extendedRegex.Substring(notPosition + 1, extendedRegex.Length - notPosition - 1);

        return Regex.IsMatch(textToTest, positiveRegex) && !Regex.IsMatch(textToTest, negativeRegex);
    }

Any suggestions on a better way to implement such an extension? I'd need to be slightly cleverer about splitting the string on the ¬ character to allow for it to be escaped, so wouldn't just use the simple Substring() splitting above. Anything else to consider?

Alternative plan

In writing this question I also came across this answer which suggests using something like this:

^(?=(?:(?!negative pattern).)*$).*?positive pattern

So I could just advise people to use a pattern like, instead of my original plan, when they want to NOT match certain text.

Would that do the equivalent of my original plan? I think it's quite an expensive way to do it peformance-wise, and since I'm sometimes parsing large html documents this might be an issue, whereas I suppose my original plan would be more performant. Any thoughts (besides the obvious: 'try both and measure them!')?

Possibly pertinent for performance: sometimes there will be several 'words' or a more complex regex that can not be in the text, like (every|all) in my example above but with a few more variations.

Why!?

I know my original approach seems weird, e.g. why not just have two regexes!? But in my particular application administrators provide the regular expressions and it would be rather difficult to give them the ability to provide two regular expressions everywhere they can currently provide one. Much easier in this case to have a syntax for NOT - just trust me on that point.

I have an app that lets administrators define regular expressions at various configuration points. The regular expressions are just used to check if text or URLs match a certain pattern; replacements aren't made and capture groups aren't used. However, often they would like to specify a pattern that says 'where ABC is not in the text'. It's notoriously difficult to do NOT matching in regular expressions, so the usual way is to have two regular expressions: one to specify a pattern that must be matched and one to specify a pattern that must not be matched. If the first is matched and the second is not then the text does match. In my application it would be a lot of work to add the ability to have a second regular expression at each place users can provide one now, so I would like to extend regular expression syntax with a way to say 'and does not contain pattern'.

like image 408
Rory Avatar asked May 03 '11 11:05

Rory


People also ask

How do you say does not contain in regex?

In order to match a line that does not contain something, use negative lookahead (described in Recipe 2.16). Notice that in this regular expression, a negative lookahead and a dot are repeated together using a noncapturing group.

What does \\ mean in regex?

The backslash character (\) in a regular expression indicates that the character that follows it either is a special character (as shown in the following table), or should be interpreted literally. For more information, see Character Escapes. Escaped character. Description. Pattern.

What does * do in regex?

The Match-zero-or-more Operator ( * ) This operator repeats the smallest possible preceding regular expression as many times as necessary (including zero) to match the pattern. `*' represents this operator. For example, `o*' matches any string made up of zero or more `o' s.


2 Answers

You don't need to introduce a new symbol. There already is support for what you need in most regex engines. It's just a matter of learning it and applying it.

You have concerns about performance, but have you tested it? Have you measured and demonstrated those performance problems? It will probably be just fine.

Regex works for many many people, in many many different scenarios. It probably fits your requirements, too.

Also, the complicated regex you found on the other SO question, can be simplified. There are simple expressions for negative and positive lookaheads and lookbehinds.
?! ?<! ?= ?<=


Some examples

Suppose the sample text is <tr valign='top'><td>Albatross</td></tr>

Given the following regex's, these are the results you will see:

  1. tr - match
  2. td - match
  3. ^td - no match
  4. ^tr - no match
  5. ^<tr - match
  6. ^<tr>.*</tr> - no match
  7. ^<tr.*>.*</tr> - match
  8. ^<tr.*>.*</tr>(?<tr>) - match
  9. ^<tr.*>.*</tr>(?<!tr>) - no match
  10. ^<tr.*>.*</tr>(?<!Albatross) - match
  11. ^<tr.*>.*</tr>(?<!.*Albatross.*) - no match
  12. ^(?!.*Albatross.*)<tr.*>.*</tr> - no match

Explanations

The first two match because the regex can apply anywhere in the sample (or test) string. The second two do not match, because the ^ says "start at the beginning", and the test string does not begin with td or tr - it starts with a left angle bracket.

The fifth example matches because the test string starts with <tr. The sixth does not, because it wants the sample string to begin with <tr>, with a closing angle bracket immediately following the tr, but in the actual test string, the opening tr includes the valign attribute, so what follows tr is a space. The 7th regex shows how to allow the space and the attribute with wildcards.

The 8th regex applies a positive lookbehind assertion to the end of the regex, using ?<. It says, match the entire regex only if what immediately precedes the cursor in the test string, matches what's in the parens, following the ?<. In this case, what follows that is tr>. After evaluating ``^.*, the cursor in the test string is positioned at the end of the test string. Therefore, thetr>` is matched against the end of the test string, which evaluates to TRUE. Therefore the positive lookbehind evaluates to true, therefore the overall regex matches.

The ninth example shows how to insert a negative lookbehind assertion, using ?<! . Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what follows ?<! in the parens, which in this case is tr>. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Because the pattern tr> does match the end of the string. But this is a negative assertion, therefore it evaluates to FALSE, which means the 9th example is NOT a match.

The tenth example uses another negative lookbehind assertion. Basically it says "allow the regex to match if what's right behind the cursor at this point, does not match what's in the parens, in this case Albatross. The bit of regex preceding the assertion, ^<tr.*>.*</tr> matches up to and including the end of the string. Checking "Albatross" against the end of the string yields a negative match, because the test string ends in </tr>. Because the pattern inside the parens of the negative lookbehind does NOT match, that means the negative lookbehind evaluates to TRUE, which means the 10th example is a match.

The 11th example extends the negative lookbehind to include wildcards; in english the result of the negative lookbehind is "only match if the preceding string does not include the word Albatross". In this case the test string DOES include the word, the negative lookbehind evaluates to FALSE, and the 11th regex does not match.

The 12th example uses a negative lookahead assertion. Like lookbehinds, lookaheads are zero-width - they do not move the cursor within the test string for the purposes of string matching. The lookahead in this case, rejects the string right away, because .*Albatross.* matches; because it is a negative lookahead, it evaluates to FALSE, which mean the overall regex fails to match, which means evaluation of the regex against the test string stops there.

example 12 always evaluates to the same boolean value as example 11, but it behaves differently at runtime. In ex 12, the negative check is performed first, at stops immediately. In ex 11, the full regex is applied, and evaluates to TRUE, before the lookbehind assertion is checked. So you can see that there may be performance differences when comparing lookaheads and lookbehinds. Which one is right for you depends on what you are matching on, and the relative complexity of the "positive match" pattern and the "negative match" pattern.

For more on this stuff, read up at http://www.regular-expressions.info/

Or get a regex evaluator tool and try out some tests.

like this tool:
enter image description here

source and binary

like image 84
Cheeso Avatar answered Oct 12 '22 12:10

Cheeso


You can easily accomplish your objectives using a single regex. Here is an example which demonstrates one way to do it. This regex matches a string containing "cat" AND "lion" AND "tiger", but does NOT contain "dog" OR "wolf" OR "hyena":

if (Regex.IsMatch(text, @"
    # Match string containing all of one set of words but none of another.
    ^                # anchor to start of string.
    # Positive look ahead assertions for required substrings.
    (?=.*?  cat   )  # Assert string has: 'cat'.
    (?=.*?  lion  )  # Assert string has: 'lion'.
    (?=.*?  tiger )  # Assert string has: 'tiger'.
    # Negative look ahead assertions for not-allowed substrings.
    (?!.*?  dog   )  # Assert string does not have: 'dog'.
    (?!.*?  wolf  )  # Assert string does not have: 'wolf'.
    (?!.*?  hyena )  # Assert string does not have: 'hyena'.
    ",
    RegexOptions.Singleline | RegexOptions.IgnoreCase |
    RegexOptions.IgnorePatternWhitespace)) {
    // Successful match
} else {
    // Match attempt failed
} 

You can see the needed pattern. When assembling the regex, be sure to run each of the user provided sub-strings through the Regex.escape() method to escape any metacharacters it may contain (i.e. (, ), | etc). Also, the above regex is written in free-spacing mode for readability. Your production regex should NOT use this mode, otherwise whitespace within the user substrings would be ignored.

You may want to add \b word boundaries before and after each "word" in each assertion if the substrings consist of only real words.

Note also that the negative assertion can be made a bit more efficient using the following alternative syntax:

(?!.*?(?:dog|wolf|hyena))

like image 21
ridgerunner Avatar answered Oct 12 '22 13:10

ridgerunner