Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Repeatable, complex regular expression, with dot '.' delimited separators

Tags:

c#

regex

I have a regular expression. It contains a required named capture group, and some optional named capture groups. It captures individual matches and parses the sections into the named groups which I need.

Except, now I need it to repeat.

Essentially, my regular expression represents an single atomic unit in a (potentially) much longer string. Instead of matching my regex exactly, the target string will usually contain repeated instances of the regex, separated by the dot '.' character.

For example, if this is what my regular expression captures: <some match>

The actual string could look like any of these:

  • <some match>
  • <some match>.<some other match>
  • <some match>.<some other match>.<yet another match>

What is the simplest way in which to modify the original regular expression, to account for the repeating patterns, while ignoring the dots?

I'm not sure if it's actually needed, but here is the regular expression which I'm using to capture individual segments. Again, I'd like to enhance this to account for optional additional segments. I'd like to have each segment appear as another "match" in the result set;

^(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])?(?:\[(?<index2>[0-9]+)\])?(?:\[(?<index3>[0-9]+)\])?$

It is intended to parse a class path, with up to three optional index accessors. (i.e. "member.sub_member[0].sub_sub_member[0][1][2]")

I suspect the answer involves look-ahead or look-behind, for which I am not entirely familiar.

I currently use String.Split to separate string segments. But I figure if the enhancement to the regex is simple enough, I skip the extra Split step, and re-use the regex as a validation mechanism, as well.

EDIT:

As an additional wrench in the gears, I'd like to disallow any dot '.' character from the beginning or end of the string. They should only exist as separators between path segments.

like image 314
BTownTKD Avatar asked Jul 19 '13 12:07

BTownTKD


2 Answers

You don't really need to use any look-arounds. You can put a (^|\.) in front of your main pattern and then a + after it. That will allow you to make a repeating, .-separated sequence. I would also recommend you combine your <index> groups into a single capture for simplicity (I used * to match any number of indexes, but you can just as easily use {0,3} to match just only up to 3). The final pattern would be:

(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])*)+$

For example:

var input = "member.sub_member[0].sub_sub_member[0][1][2]";
var pattern = @"(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\])*)+$";
var match = Regex.Match(input, pattern);
var parts = 
    (from Group g in match.Groups
     from Capture c in g.Captures
     orderby c.Index
     select c.Value)
    .Skip(1);

foreach(var part in parts)
{
    Console.WriteLine(part);
}

Which will output:

member
sub_member
0
sub_sub_member
0
1
2

Update: This pattern will ensure that the string cannot have any leading or trailing dots. It's a monster, but it should work:

^(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3}(?:\.(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3})*$

Or this one, although I did have to give up on my 'no-look-arounds' idea:

^(?!\.)(?:(?:^|\.)(?<member>[A-Za-z_][A-Za-z0-9_]*)(?:\[(?<index>[0-9]+)\]){0,3})*$
like image 196
p.s.w.g Avatar answered Nov 16 '22 08:11

p.s.w.g


The easiest way is likely to split the string using string.Split on the '.' character, and then apply your regular expression to each element in the resulting array. A Regex that long would have some brutal performance and potential lookahead/behind problems anyway.

like image 1
Haney Avatar answered Nov 16 '22 10:11

Haney