Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

.NET regex captures not in expected order

Tags:

.net

regex

In .NET, regex is not organizing captures as I would expect. (I won't call this a bug, because obviously someone intended it. However, it's not how I'd expect it to work nor do I find it helpful.)

This regex is for recipe ingredients (simplified for sake of example):

(?<measurement>           # begin group
  \s*                     # optional beginning space or group separator
  (
    (?<integer>\d+)|      # integer
    (
      (?<numtor>\d+)      # numerator
      /
      (?<dentor>[1-9]\d*) # denominator. 0 not allowed
    )
  )
  \s(?<unit>[a-zA-Z]+)
)+                        # end group. can have multiple

My string: 3 tbsp 1/2 tsp

Resulting groups and captures:

[measurement][0]=3 tbsp
[measurement][1]= 1/2 tsp
[integer][0]=3
[numtor][0]=1
[dentor][0]=2
[unit][0]=tbsp
[unit][1]=tsp

Notice how even though 1/2 tsp is in the 2nd Capture, it's parts are in [0] since these spots were previously unused.

Is there any way to get all of the parts to have predictable useful indexes without having to re-run each group through the regex again?

like image 459
Dinah Avatar asked Nov 14 '22 08:11

Dinah


2 Answers

Is there any way to get all of the parts to have predictable useful indexes without having to re-run each group through the regex again?

Not with Captures. And if you're going to perform multiple matches anyway, I suggest you remove the + and match each component of the measurement separately, like so:

  string s = @"3 tbsp 1/2 tsp";

  Regex r = new Regex(@"\G\s* # anchor to end of previous match
    (?<measurement>           # begin group
      (
        (?<integer>\d+)       # integer
      |
        (
          (?<numtor>\d+)      # numerator
          /
          (?<dentor>[1-9]\d*) # denominator. 0 not allowed
        )
      )
      \s+(?<unit>[a-zA-Z]+)
    )                         # end group.
  ", RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture);

  foreach (Match m in r.Matches(s))
  {
    for (int i = 1; i < m.Groups.Count; i++)
    {
      Group g = m.Groups[i];
      if (g.Success)
      {
        Console.WriteLine("[{0}] = {1}", r.GroupNameFromNumber(i), g.Value);
      }
    }
    Console.WriteLine("");
  }

output:

[measurement] = 3 tbsp
[integer] = 3
[unit] = tbsp

[measurement] = 1/2 tsp
[numtor] = 1
[dentor] = 2
[unit] = tsp

The \G at the beginning ensures that matches occur only at the point where the previous match ended (or at the beginning of the input if this is the first match attempt). You can also save the match-end position between calls, then use the two-argument Matches method to resume parsing at that same point (as if that were really the beginning of the input).

like image 125
Alan Moore Avatar answered Dec 17 '22 08:12

Alan Moore


Seems like you probably need to loop through the input, matching one measurement at a time. Then you would have predictable access to the parts of that measurement, during the loop iteration for that measurement.

like image 44
LarsH Avatar answered Dec 17 '22 07:12

LarsH