I have a regular expression. It contains a required named capture group, and some optional named capture groups. It captures individual matches and parses the sections into the named groups which I need.
Except, now I need it to repeat.
Essentially, my regular expression represents an single atomic unit in a (potentially) much longer string. Instead of matching my regex exactly, the target string will usually contain repeated instances of the regex, separated by the dot '.' character.
For example, if this is what my regular expression captures: <some match>
The actual string could look like any of these:
<some match>
<some match>.<some other match>
<some match>.<some other match>.<yet another match>
What is the simplest way in which to modify the original regular expression, to account for the repeating patterns, while ignoring the dots?
I'm not sure if it's actually needed, but here is the regular expression which I'm using to capture individual segments. Again, I'd like to enhance this to account for optional additional segments. I'd like to have each segment appear as another "match" in the result set;
^(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])?(?:\[(?<index2>[0-9]+)\])?(?:\[(?<index3>[0-9]+)\])?$
It is intended to parse a class path, with up to three optional index accessors. (i.e. "member.sub_member[0].sub_sub_member[0][1][2]
")
I suspect the answer involves look-ahead or look-behind, for which I am not entirely familiar.
I currently use String.Split to separate string segments. But I figure if the enhancement to the regex is simple enough, I skip the extra Split step, and re-use the regex as a validation mechanism, as well.
EDIT:
As an additional wrench in the gears, I'd like to disallow any dot '.' character from the beginning or end of the string. They should only exist as separators between path segments.
You don't really need to use any look-arounds. You can put a (^|\.)
in front of your main pattern and then a +
after it. That will allow you to make a repeating, .
-separated sequence. I would also recommend you combine your <index>
groups into a single capture for simplicity (I used *
to match any number of indexes, but you can just as easily use {0,3}
to match just only up to 3). The final pattern would be:
(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])*)+$
For example:
var input = "member.sub_member[0].sub_sub_member[0][1][2]";
var pattern = @"(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])*)+$";
var match = Regex.Match(input, pattern);
var parts =
(from Group g in match.Groups
from Capture c in g.Captures
orderby c.Index
select c.Value)
.Skip(1);
foreach(var part in parts)
{
Console.WriteLine(part);
}
Which will output:
member
sub_member
0
sub_sub_member
0
1
2
Update: This pattern will ensure that the string cannot have any leading or trailing dots. It's a monster, but it should work:
^(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3}(?:\.(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3})*$
Or this one, although I did have to give up on my 'no-look-arounds' idea:
^(?!\.)(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3})*$
The easiest way is likely to split the string using string.Split
on the '.' character, and then apply your regular expression to each element in the resulting array. A Regex that long would have some brutal performance and potential lookahead/behind problems anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With