Before anybody asks, I am not doing any kind of screenscraping.
I'm trying to parse an html string to find a div with a certain id. I cannot for the life of me get this to work. The following expression worked in one instance, but not in another. I'm not sure if it has to do with extra elements in the html or not.
<div\s*?id=(\""|"|")content(\""|"|").*?>\s*?(?>(?! <div\s*?> | </div> ) | <div\s*?>(?<DEPTH>) | </div>(?<-DEPTH>) | .?)*(?(DEPTH)(?!))</div>
It is finding the first div with the right id correctly, but it then closes at the first closing div, and not the related div.
<div id="firstdiv">begining content<div id="content">some other stuff
<div id="otherdiv">other stuff here</div>
more stuff
</div>
</div>
This should bring back
<div id="content">some other stuff
<div id="otherdiv">other stuff here</div>
more stuff
</div>
, but for some reason, it is not. It is bring back:
<div id="content">some other stuff
<div id="otherdiv">other stuff here</div>
Does anybody have an easier expression to handle this?
To clarify, this is in .NET, and I'm using the DEPTH keyword. You can find more details here.
In .NET you can do this:
(?<text>
(<div\s*?id=(\"|"|&\#34;)content(\"|"|&\#34;).*?>)
(?>
.*?</div>
|
.*?<div (?>depth)
|
.*?</div> (?>-depth)
)*)
(?(depth)(?!))
.*?</div>
You must use the singleline option. Here is an example using the console:
using System;
using System.Text.RegularExpressions;
namespace Temp
{
class Program
{
static void Main()
{
string s = @"
<div id=""firstdiv"">begining content<div id=""content"">some other stuff
<div id=""otherdiv"">other stuff here</div>
more stuff
</div>
</div>";
Regex r = new Regex(@"(?<text>(<div\s*?id=(\""|"|&\#34;)"
+ @"content(\""|"|&\#34;).*?>)(?>.*?</div>|.*?<div "
+ @"(?>depth)|.*?</div> (?>-depth))*)(?(depth)(?!)).*?</div>",
RegexOptions.Singleline);
Console.WriteLine("HTML:\n");
Console.WriteLine(s);
Match m = r.Match(s);
if (m.Success)
{
Console.WriteLine("\nCaptured text:\n");
Console.WriteLine(m.Groups[4]);
}
Console.ReadLine();
}
}
}
Are you asking for a regular expression that can keep track of the number of DIV tags nested inside a DIV tag? I'm afraid that isn't possible with regular expressions.
You could use a regular expression to get the index of the first DIV tag, then loop over the characters in the string, starting at that index, and keeping a count of the number of open div tags. When you encounter a close div-tag, and the count is zero, then you have the starting and ending indices in the string that contains the substring you want.
Cybis speaks the truth. This sort of stuff falls into Context-Free Languages, which are more powerful than Regular Languages (the kind of things covered by regular expressions). There's a lot of computer science theory involved, but let it rest to say that any language worth its salt will have a library for this sort of stuff written that you should probably be using.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With