Problem:
Extract all html between two headers including the headers html. The header text is known, but not the formatting, tag name, etc. They are not within the same parent and might (well, almost for sure) have sub children within it's own children).
To clarify: headers could be inside a <h1>
or <div>
or any other tag. They may also be surrounded by <b>
, <i>
, <font>
or more <div>
tags. The key is: the only text within the element is the header text.
The tools I have available are: C# 3.0 utilizing a WebBrowser control, or Jquery/Js.
I've taken the Jquery route, traversing the DOM, but I've ran into the issue of children and adding them appropriately. Here is the code so far:
function getAllBetween(firstEl,lastEl) {
var collection = new Array(); // Collection of Elements
var fefound =false;
$('body').find('*').each(function(){
var curEl = $(this);
if($(curEl).text() == firstEl)
fefound=true;
if($(curEl).text() == lastEl)
return false;
// need something to add children children
// otherwise we get <table></table><tbody></tbody><tr></tr> etc
if (fefound)
collection.push(curEl);
});
var div = document.createElement("DIV");
for (var i=0,len=collection.length;i<len;i++){
$(div).append(collection[i]);
}
return($(div).html());
}
Should I be continueing down this road? With some sort of recursive function checking/handling children, or would a whole new approach be better suited?
For the sake of testing, here is some sample markup:
<body>
<div>
<div>Start</div>
<table><tbody><tr><td>Oops</td></tr></tbody></table>
</div>
<div>
<div>End</div>
</div>
</body>
Any suggestions or thoughts are greatly appreciated!
My thought is a regex, something along the lines of
.*<(?<tag>.+)>Start</\1>(?<found_data>.+)<\1>End</\1>.*
should get you everything between the Start and end div tags.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With