Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I match the first subpattern in C#?

Tags:

c#

regex

I did this pattern to match nested divs:

(<div[^>]*>(?:\g<1>|.)*?<\/div>)

This works nicely, as you can see in regex101.

However, when I write the code below in C# :

Regex findDivs = new Regex("(<div[^>]*>(?:\\g<1>|.)*?<\\/div>)", RegexOptions.Singleline);

It throws me an error:

Additional information: 
    parsing "(<div[^>]*>(?:\g<1>|.)*?<\/div>)" - 
        Unrecognized escape sequence \g.

As you can see \g doesn't work in c#. How can I match the first subpattern then?

like image 687
João Ferreira Avatar asked Oct 19 '22 08:10

João Ferreira


1 Answers

What you are looking for is balancing groups. Here is a one-to-one conversion of your regex to .NET:

(?sx)<div[^>]*>                   # Opening DIV
    (?>                           # Start of atomic group
        (?:(?!</?div[^>]*>).)+    # (1) Any text other than open/close DIV
        |   <div[^>]*> (?<tag>)   # Add 1 "tag" value to stack if opening DIV found 
        |   </div> (?<-tag>)      # Remove 1 "tag" value from stack when closing DIV tag is found
    )*
    (?(tag)(?!))                  # Check if "tag" stack is not empty (then fail)
</div>

See the regex demo

However, you might really want to use HtmlAgilityPack to parse HTML.

The main point is to get an XPath that will match all DIV tags that have no ancestors with the same name. You might want something like this (untested):

private List<string> GetTopmostDivs(string html)
{
    var result = new List<KeyValuePair<string, string>>();
    HtmlAgilityPack.HtmlDocument hap;
    Uri uriResult;
    if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
    { // html is a URL 
        var doc = new HtmlAgilityPack.HtmlWeb();
        hap = doc.Load(uriResult.AbsoluteUri);
    }
    else
    { // html is a string
        hap = new HtmlAgilityPack.HtmlDocument();
        hap.LoadHtml(html);
    }
    var nodes = hap.DocumentNode.SelectNodes("//div[not(ancestor::div)]");
    if (nodes != null)
        return nodes.Select(p => p.OuterHtml).ToList();
    else
        return new List<string>();
}
like image 163
Wiktor Stribiżew Avatar answered Oct 27 '22 00:10

Wiktor Stribiżew