.NET offers a Capture collection in its RegularExpression implementation so you can get all instances of a given repeating group rather than just the last instance of it. That's great, but I have a repeating group with subgroups and I'm trying to get at the subgroups as they are related under the group, and can't find a way. Any suggestions?
I've looked at number of other questions, e.g.:
but I have found no applicable answer either affirmative ("Yep, here's how") or negative ("Nope, can't be done.").
For a contrived example say I have an input string:
abc d x 1 2 x 3 x 5 6 e fgh
where the "abc" and "fgh" represent text that I want to ignore in the larger document, "d" and "e" wrap the area of interest, and within that area of interest, "x n [n]" can repeat any number of times. It's those number pairs in the "x" areas that I'm interested in.
So I'm parsing it using this regular expression pattern:
.*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*
which will find exactly one match in the document, but capture the "x" group many times. Here are the three pairs I would want to extract in this example:
but how can I get them? I could do the following (in C#):
using System;
using System.Text;
using System.Text.RegularExpressions;
string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
foreach (var x in Regex.Match(input, pattern).Groups["x"].Captures) {
MessageBox.Show(x.ToString());
}
and since I'm referencing group "x" I get these strings:
But that doesn't get me at the numbers themselves. So I could do "fir" and "sec" independently instead of just "x":
using System;
using System.Text;
using System.Text.RegularExpressions;
string input = "abc d x 1 2 x 3 x 5 6 e fgh";
string pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
Match m = Regex.Match(input, pattern);
foreach (var f in m.Groups["fir"].Captures) {
MessageBox.Show(f.ToString());
}
foreach (var s in m.Groups["sec"].Captures) {
MessageBox.Show(s.ToString());
}
to get:
but then I have no way of knowing that it's the second pair that's missing the "4", and not one of the other pairs.
So what to do? I know I could easily parse this out in C# or even with a second regex test on the "x" group, but since the first RegEx run has already done all the work and the results ARE known, it seems there ought to be a way to manipulate the Match object to get what I need out of it.
And remember, this is a contrived example, the real world case is somewhat more complex so just throwing extra C# code at it would be a pain. But if the existing .NET objects can't do it, then I just need to know that and I'll continue on my way.
Thoughts?
I am not aware of a fully build in solution and could not find one after a quick search, but this does not exclude the possibility that there is one.
My best suggestion is to use the Index
and Length
properties to find matching captures. It seems not really elegant but you might be able to come up with some quite nice code after writing some extension methods.
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
var match = Regex.Match(input, pattern);
var xs = match.Groups["x"].Captures.Cast<Capture>();
var firs = match.Groups["fir"].Captures.Cast<Capture>();
var secs = match.Groups["sec"].Captures.Cast<Capture>();
Func<Capture, Capture, Boolean> test = (inner, outer) =>
(inner.Index >= outer.Index) &&
(inner.Index < outer.Index + outer.Length);
var result = xs.Select(x => new
{
Fir = firs.FirstOrDefault(f => test(f, x)),
Sec = secs.FirstOrDefault(s => test(s, x))
})
.ToList();
Here one possible solution using the following extension method.
internal static class Extensions
{
internal static IEnumerable<Capture> GetCapturesInside(this Match match,
Capture capture, String groupName)
{
var start = capture.Index;
var end = capture.Index + capture.Length;
return match.Groups[groupName]
.Captures
.Cast<Capture>()
.Where(inner => (inner.Index >= start) &&
(inner.Index < end));
}
}
Now the you can rewrite the code as follows.
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var pattern = @".*d (?<x>x ((?<fir>\d+) )?((?<sec>\d+) )?)*?e.*";
var match = Regex.Match(input, pattern);
foreach (Capture x in match.Groups["x"].Captures)
{
var fir = match.GetCapturesInside(x, "fir").SingleOrDefault();
var sec = match.GetCapturesInside(x, "sec").SingleOrDefault();
}
Will it always be a pair versus single? You could use separate capture groups. Of course, you lose the order of items with this method.
var input = "abc d x 1 2 x 3 x 5 6 e fgh";
var re = new Regex(@"d\s(?<x>x\s((?<pair>\d+\s\d+)|(?<single>\d+))\s)*e");
var m = re.Match(input);
foreach (Capture s in m.Groups["pair"].Captures)
{
Console.WriteLine(s.Value);
}
foreach (Capture s in m.Groups["single"].Captures)
{
Console.WriteLine(s.Value);
}
1 2
5 6
3
If you need the order, I'd probably go with Blam's suggestion to use a second regular expression.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With