How can I match the first subpattern in C#?

Question

I did this pattern to match nested divs:

(<div[^>]*>(?:\g<1>|.)*?</div>)

This works nicely, as you can see in regex101.

However, when I write the code below in C# :

Regex findDivs = new Regex("(<div[^>]*>(?:\g<1>|.)*?<\/div>)", RegexOptions.Singleline);

It throws me an error:

Additional information: 
    parsing "(<div[^>]*>(?:\g<1>|.)*?</div>)" - 
        Unrecognized escape sequence \g.

As you can see \g doesn't work in c#. How can I match the first subpattern then?

Wiktor Stribiżew · Accepted Answer

What you are looking for is balancing groups. Here is a one-to-one conversion of your regex to .NET:

(?sx)<div[^>]*>                   # Opening DIV
    (?>                           # Start of atomic group
        (?:(?!</?div[^>]*>).)+    # (1) Any text other than open/close DIV
        |   <div[^>]*> (?<tag>)   # Add 1 "tag" value to stack if opening DIV found 
        |   </div> (?<-tag>)      # Remove 1 "tag" value from stack when closing DIV tag is found
    )*
    (?(tag)(?!))                  # Check if "tag" stack is not empty (then fail)
</div>

See the regex demo

However, you might really want to use HtmlAgilityPack to parse HTML.

The main point is to get an XPath that will match all DIV tags that have no ancestors with the same name. You might want something like this (untested):

private List<string> GetTopmostDivs(string html)
{
    var result = new List<KeyValuePair<string, string>>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//div[not(ancestor::div)]");
    if (nodes != null)
        return nodes.Select(p => p.OuterHtml).ToList();
    else
        return new List<string>();
}

How can I match the first subpattern in C#?

Tags:

c#

regex

João Ferreira

1 Answers

Wiktor Stribiżew

Recent Activity

Donate For Us

How can I match the first subpattern in C#?

Tags:

c#

regex

João Ferreira

1 Answers

Wiktor Stribiżew

Related questions

Recent Activity

Donate For Us