Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I specify the priority of a match pattern in a Regex?

I'm writing a function-parsing engine which uses regular expressions to separate the individual terms (defined as a constant or a variable followed (optionally) by an operator). It's working great, except when I have grouped terms within other grouped terms. Here's the code I'm using:

//This matches an opening delimiter
Regex openers = new Regex("[\\[\\{\\(]");

//This matches a closing delimiter
Regex closers = new Regex("[\\]\\}\\)]");

//This matches the name of a variable (\w+) or a constant numeric value (\d+(\.\d+)?)
Regex VariableOrConstant = new Regex("((\\d+(\\.\\d+)?)|\\w+)" + FunctionTerm.opRegex + "?");

//This matches the binary operators +, *, -, or /
Regex ops = new Regex("[\\*\\+\\-/]");

//This compound Regex finds a single variable or constant term (including a proceeding operator,
//if any) OR a group containing multiple terms (and their proceeding operators, if any)
//and a proceeding operator, if any.
//Matches that match this second pattern need to be added to the function as sub-functions,
//not as individual terms, to ensure the correct evalutation order with parentheses.
Regex splitter = new Regex(
openers + 
"(" + VariableOrConstant + ")+" + closers + ops + "?" +
"|" +
"(" + VariableOrConstant + ")" + ops + "?");

When "splitter" is matched against the string "4/(2*X*[2+1])", the matches' values are "4/", "2*", "X*", "2+", and "1", completely ignoring all of the delimiting parentheses and braces. I believe this is because the second half of the "splitter" Regex (the part after the "|") is being matched and overriding the other part of the pattern. This is bad- I want grouped expressions to take precedence over single terms. Does anyone know how I can do this? I looked into using positive/negative lookaheads and lookbehinds, but I'm honestly not sure how to use those, or what they're even for, for that matter, and I can't find any relevant examples... Thanks in advance.

like image 719
Michael Hoffmann Avatar asked Dec 13 '10 01:12

Michael Hoffmann


People also ask

What is an anchor in regex?

Anchors belong to the family of regex tokens that don't match any characters, but that assert something about the string or the matching process. Anchors assert that the engine's current position in the string matches a well-determined location: for instance, the beginning of the string, or the end of a line.

Does * match everything in regex?

Throw in an * (asterisk), and it will match everything. Read more. \s (whitespace metacharacter) will match any whitespace character (space; tab; line break; ...), and \S (opposite of \s ) will match anything that is not a whitespace character.

How do you match a sequence in regex?

How do you match a character sequence in regex? To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches “.” ; regex \+ matches “+” ; and regex \( matches “(” .


2 Answers

You didn't show us how you're applying the regex, so here's a demo I whipped up:

private static void ParseIt(string subject)
{
  Console.WriteLine("subject : {0}\n", subject);

  Regex openers = new Regex(@"[[{(]");
  Regex closers = new Regex(@"[]})]");
  Regex ops = new Regex(@"[*+/-]");
  Regex VariableOrConstant = new Regex(@"((\d+(\.\d+)?)|\w+)" + ops + "?");

  Regex splitter = new Regex(
    openers + @"(?<FIRST>" + VariableOrConstant + @")+" + closers + ops + @"?" +
    @"|" +
    @"(?<SECOND>" + VariableOrConstant + @")" + ops + @"?",
    RegexOptions.ExplicitCapture
  );

  foreach (Match m in splitter.Matches(subject))
  {
    foreach (string s in splitter.GetGroupNames())
    {
      Console.WriteLine("group {0,-8}: {1}", s, m.Groups[s]);
    }
    Console.WriteLine();
  }
}

output:

subject : 4/(2*X*[2+1])

group 0       : 4/
group FIRST   :
group SECOND  : 4/

group 0       : 2*
group FIRST   :
group SECOND  : 2*

group 0       : X*
group FIRST   :
group SECOND  : X*

group 0       : [2+1]
group FIRST   : 1
group SECOND  :

As you can see, the term [2+1] is matched by the first part of the regex, as you intended. It can't do anything with the (, though, because the next bracketing character after that is another "opener" ([), and it's looking for a "closer".

You could use .NET's "balanced matching" feature to allow for grouped terms enclosed in other groups, but it's not worth the effort. Regexes are not designed for parsing--in fact, parsing and regex matching are fundamentally different kinds of operation. And this is a good example of the difference: a regex actively seeks out matches, skipping over anything it can't use (like the open-parenthesis in your example), but a parser has to examine every character (even if it's just to decide to ignore it).

About the demo: I tried to make the minimum functional changes necessary to get your code to work (which is why I didn't correct the error of putting the + outside the capturing group), but I also made several surface changes, and those represent active recommendations. To wit:

  • Always use verbatim string literals (@"...") when creating regexes in C# (I think the reason is obvious).
  • If you're using capturing groups, use named groups whenever possible, but don't use named groups and numbered groups in the same regex. Named groups save you the hassle of keeping track of what's captured where, and the ExplicitCapture option saves you having to clutter up the regex with (?:...) wherever you need a non-capturing group.

Finally, that whole scheme of building a large regex from a bunch of smaller regexes has very limited usefulness IMO. It's very difficult to keep track of the interactions between the parts, like which part's inside which group. Another advantage of C#'s verbatim strings is that they're multiline, so you can take advantage of free-spacing mode (a.k.a. /x or COMMENTS mode):

  Regex r = new Regex(@"
    (?<GROUPED>
      [[{(]                  # opening bracket
      (                      # group containing:
        ((\d+(\.\d+)?)|\w+)     # number or variable
        [*+/-]?                 # and proceeding operator
      )+                     # ...one or more times
      []})]                  # closing bracket
      [*+/-]?                # and proceeding operator
    )
    |
    (?<UNGROUPED>
      ((\d+(\.\d+)?)|\w+)    # number or variable
      [*+/-]?                # and proceeding operator
    )
    ",
    RegexOptions.ExplicitCapture | RegexOptions.IgnorePatternWhitespace
  );

This is not intended as a solution to your problem; as I said, that's not a job for regexes. This is just a demonstration of some useful regex techniques.

like image 200
Alan Moore Avatar answered Nov 04 '22 03:11

Alan Moore


try using difrent quantifiers

greedy:

*  +  ?

possessive:

*+ ++ ?+

lazy:

*? +? ??

Try reading this and this

also maybe non-capturing group:

(?:your expr here)

try try try! practice make perfect! :)

like image 21
marverix Avatar answered Nov 04 '22 05:11

marverix