Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In .NET's RegEx can I get a Groups collection from a Capture object?

.NET offers a Capture collection in its RegularExpression implementation so you can get all instances of a given repeating group rather than just the last instance of it. That's great, but I have a repeating group with subgroups and I'm trying to get at the subgroups as they are related under the group, and can't find a way. Any suggestions?

I've looked at number of other questions, e.g.:

  • Select multiple elements in a regular expression
  • Regex .NET attached named group
  • How can I get the Regex Groups for a given Capture?

but I have found no applicable answer either affirmative ("Yep, here's how") or negative ("Nope, can't be done.").

For a contrived example say I have an input string:

abc d x 1 2 x 3 x 5 6 e fgh

where the "abc" and "fgh" represent text that I want to ignore in the larger document, "d" and "e" wrap the area of interest, and within that area of interest, "x n [n]" can repeat any number of times. It's those number pairs in the "x" areas that I'm interested in.

So I'm parsing it using this regular expression pattern:

.*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*

which will find exactly one match in the document, but capture the "x" group many times. Here are the three pairs I would want to extract in this example:

  • 1, 2
  • 3
  • 5, 6

but how can I get them? I could do the following (in C#):

using System;
using System.Text;
using System.Text.RegularExpressions;

string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
foreach (var x in Regex.Match(input, pattern).Groups["x"].Captures) {
    MessageBox.Show(x.ToString());
}

and since I'm referencing group "x" I get these strings:

  • x 1 2
  • x 3
  • x 5 6

But that doesn't get me at the numbers themselves. So I could do "fir" and "sec" independently instead of just "x":

using System;
using System.Text;
using System.Text.RegularExpressions;

string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
Match m = Regex.Match(input, pattern);
foreach (var f in m.Groups["fir"].Captures) {
    MessageBox.Show(f.ToString());
}

foreach (var s in m.Groups["sec"].Captures) {
    MessageBox.Show(s.ToString());
}

to get:

  • 1
  • 3
  • 5
  • 2
  • 6

but then I have no way of knowing that it's the second pair that's missing the "4", and not one of the other pairs.

So what to do? I know I could easily parse this out in C# or even with a second regex test on the "x" group, but since the first RegEx run has already done all the work and the results ARE known, it seems there ought to be a way to manipulate the Match object to get what I need out of it.

And remember, this is a contrived example, the real world case is somewhat more complex so just throwing extra C# code at it would be a pain. But if the existing .NET objects can't do it, then I just need to know that and I'll continue on my way.

Thoughts?

like image 265
bob Avatar asked Dec 17 '12 17:12

bob


2 Answers

I am not aware of a fully build in solution and could not find one after a quick search, but this does not exclude the possibility that there is one.

My best suggestion is to use the Index and Length properties to find matching captures. It seems not really elegant but you might be able to come up with some quite nice code after writing some extension methods.

var input = "abc d x 1 2 x 3 x 5 6 e fgh";

var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";

var match = Regex.Match(input, pattern);

var xs = match.Groups["x"].Captures.Cast<Capture>();

var firs = match.Groups["fir"].Captures.Cast<Capture>();
var secs = match.Groups["sec"].Captures.Cast<Capture>();

Func<Capture, Capture, Boolean> test = (inner, outer) =>
    (inner.Index >= outer.Index) &&
    (inner.Index < outer.Index + outer.Length);

var result = xs.Select(x => new
                            {
                                Fir = firs.FirstOrDefault(f => test(f, x)),
                                Sec = secs.FirstOrDefault(s => test(s, x))
                            })
               .ToList();

Here one possible solution using the following extension method.

internal static class Extensions
{
    internal static IEnumerable<Capture> GetCapturesInside(this Match match,
         Capture capture, String groupName)
    {
        var start = capture.Index;
        var end = capture.Index + capture.Length;

        return match.Groups[groupName]
                    .Captures
                    .Cast<Capture>()
                    .Where(inner => (inner.Index >= start) &&
                                    (inner.Index < end));
    }
}

Now the you can rewrite the code as follows.

var input = "abc d x 1 2 x 3 x 5 6 e fgh";

var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";

var match = Regex.Match(input, pattern);

foreach (Capture x in match.Groups["x"].Captures)
{
    var fir = match.GetCapturesInside(x, "fir").SingleOrDefault();
    var sec = match.GetCapturesInside(x, "sec").SingleOrDefault();
}
like image 83
Daniel Brückner Avatar answered Nov 07 '22 22:11

Daniel Brückner


Will it always be a pair versus single? You could use separate capture groups. Of course, you lose the order of items with this method.

var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var re = new Regex(@"d\s(?<x>x\s((?<pair>\d+\s\d+)|(?<single>\d+))\s)*e");

var m = re.Match(input);
foreach (Capture s in m.Groups["pair"].Captures) 
{
    Console.WriteLine(s.Value);
}
foreach (Capture s in m.Groups["single"].Captures)
{
    Console.WriteLine(s.Value);
}

1 2
5 6
3

If you need the order, I'd probably go with Blam's suggestion to use a second regular expression.

like image 29
Adam Prescott Avatar answered Nov 07 '22 20:11

Adam Prescott