I did this pattern to match nested divs:
(<div[^>]*>(?:\g<1>|.)*?<\/div>)
This works nicely, as you can see in regex101.
However, when I write the code below in C# :
Regex findDivs = new Regex("(<div[^>]*>(?:\\g<1>|.)*?<\\/div>)", RegexOptions.Singleline);
It throws me an error:
Additional information:
parsing "(<div[^>]*>(?:\g<1>|.)*?<\/div>)" -
Unrecognized escape sequence \g.
As you can see \g
doesn't work in c#. How can I match the first subpattern then?
What you are looking for is balancing groups. Here is a one-to-one conversion of your regex to .NET:
(?sx)<div[^>]*> # Opening DIV
(?> # Start of atomic group
(?:(?!</?div[^>]*>).)+ # (1) Any text other than open/close DIV
| <div[^>]*> (?<tag>) # Add 1 "tag" value to stack if opening DIV found
| </div> (?<-tag>) # Remove 1 "tag" value from stack when closing DIV tag is found
)*
(?(tag)(?!)) # Check if "tag" stack is not empty (then fail)
</div>
See the regex demo
However, you might really want to use HtmlAgilityPack to parse HTML.
The main point is to get an XPath that will match all DIV tags that have no ancestors with the same name. You might want something like this (untested):
private List<string> GetTopmostDivs(string html)
{
var result = new List<KeyValuePair<string, string>>();
HtmlAgilityPack.HtmlDocument hap;
Uri uriResult;
if (Uri.TryCreate(html, UriKind.Absolute, out uriResult) && uriResult.Scheme == Uri.UriSchemeHttp)
{ // html is a URL
var doc = new HtmlAgilityPack.HtmlWeb();
hap = doc.Load(uriResult.AbsoluteUri);
}
else
{ // html is a string
hap = new HtmlAgilityPack.HtmlDocument();
hap.LoadHtml(html);
}
var nodes = hap.DocumentNode.SelectNodes("//div[not(ancestor::div)]");
if (nodes != null)
return nodes.Select(p => p.OuterHtml).ToList();
else
return new List<string>();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With