Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RegEx to match a pattern, as long as it is not preceded by a different pattern

Tags:

c#

.net

regex

I need a regex that is to be used for text substitution. Example: text to be matched is ABC (which could be surrounded by square brackets), substitution text is DEF. This is basic enough. The complication is that I don't want to match the ABC text when it is preceded by the pattern \[[\d ]+\]\. - in other words, when it is preceded by a word or set of words in brackets, followed by a period.

Here are some examples of source text to be matched, and the result, after the regex substitution would be made:

1. [xxx xxx].[ABC] > [xxx xxx].[ABC] (does not match - first part fits the pattern)
2. [xxx xxx].ABC   > [xxx xxx].ABC   (does not match - first part fits the pattern)
3. [xxx.ABC        > [xxx.DEF        (matches - first part has no closing bracket)
4. [ABC]           > [DEF]           (matches - no first part)
5. ABC             > DEF             (matches - no first part)
6. [xxx][ABC]      > [xxx][DEF]      (matches - no period in between)
7. [xxx]. [ABC]    > [xxx] [DEF]     (matches - space in between)

What it comes down to is: how can I specify the preceding pattern that when present as described will prevent a match? What would the pattern be in this case? (C# flavor of regex)

like image 329
Yaakov Ellis Avatar asked Nov 02 '10 19:11

Yaakov Ellis


People also ask

How do I match a pattern in regex?

Most characters, including all letters ( a-z and A-Z ) and digits ( 0-9 ), match itself. For example, the regex x matches substring "x" ; z matches "z" ; and 9 matches "9" . Non-alphanumeric characters without special meaning in regex also matches itself. For example, = matches "=" ; @ matches "@" .

What does '$' mean in regex?

$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.

What does regex 0 * 1 * 0 * 1 * Mean?

Basically (0+1)* mathes any sequence of ones and zeroes. So, in your example (0+1)*1(0+1)* should match any sequence that has 1. It would not match 000 , but it would match 010 , 1 , 111 etc. (0+1) means 0 OR 1.

What is zero length match regex?

The regex \d* matches zero or more digits. If the subject string does not contain any digits, then this regex finds a zero-length match at every position in the string. It finds 4 matches in the string abc, one before each of the three letters, and one at the end of the string.


1 Answers

You want a negative look-behind expression. These look like (?<!pattern), so:

(?<!\[[\d ]+\]\.)\[?ABC\]?

Note that this does not force a matching pair of square brackets around ABC; it just allows for an optional open bracket before and an optional close bracket after. If you wanted to force a matching pair or none, you'd have to use alternation:

(?<!\[[\d ]+\]\.)(?:ABC|\[ABC\])

This uses non-capturing parentheses to delimit the alternation. If you want to actually capture ABC, you can of turn that into a capture group.

ETA: The reason the first expression seems to fail is that it is matching on ABC], which is not preceded by the prohibited text. The open bracket [ is optional, so it just doesn't match that. The way around this is to shift the optional open bracket [ into the negative look-behind assertion, like so:

(?<!\[[\d ]+\]\.\[?)ABC\]?

An example of what it matches and doesn't:

[123].[ABC]: fail (expected: fail)
[123 456].[ABC]: fail (expected: fail)
[123.ABC: match (expected: match)
    matched: ABC
ABC: match (expected: match)
    matched: ABC
[ABC]: match (expected: match)
    matched: ABC]
[ABC[: match (expected: fail)
    matched: ABC

Trying to make the presence of an open bracket [ force a matching close bracket ], as the second pattern intended, is trickier, but this seems to work:

(?:(?<!\[[\d ]+\]\.\[)ABC\]|(?<!\[[\d ]+\]\.)(?<!\[)ABC(?!\]))

An example of what it matches and doesn't:

[123].[ABC]: fail (expected: fail)
[123 456].[ABC]: fail (expected: fail)
[123.ABC: match (expected: match)
    matched: ABC
ABC: match (expected: match)
    matched: ABC
[ABC]: match (expected: match)
    matched: ABC]
[ABC[: fail (expected: fail)

The examples were generated using this code:

// Compile and run with: mcs so_regex.cs && mono so_regex.exe
using System;
using System.Text.RegularExpressions;

public class SORegex {
  public static void Main() {
    string[] values = {"[123].[ABC]", "[123 456].[ABC]", "[123.ABC", "ABC", "[ABC]", "[ABC["};
    string[] expected = {"fail", "fail", "match", "match", "match", "fail"};
    string pattern = @"(?<!\[[\d ]+\]\.\[?)ABC\]?";  // Don't force [ to match ].
    //string pattern = @"(?:(?<!\[[\d ]+\]\.\[)ABC\]|(?<!\[[\d ]+\]\.)(?<!\[)ABC(?!\]))";  // Force balanced brackets.
    Console.WriteLine("pattern: {0}", pattern);
    int i = 0;
    foreach (string text in values) {
      Match m = Regex.Match(text, pattern);
      bool isMatch = m.Success;
      Console.WriteLine("{0}: {1} (expected: {2})", text, isMatch? "match" : "fail", expected[i++]);
      if (isMatch) Console.WriteLine("\tmatched: {0}", m.Value);
    }
  }
}
like image 131
Jeremy W. Sherman Avatar answered Oct 24 '22 14:10

Jeremy W. Sherman