Assuming that I have an HTML page as follows:
<!-- This is the opening tag -->
<div class="content_text">
<div>Title</div>
<div>Author Name</div>
<div>Some complicated HTML elements correctly validated</div>
<b>Some more text</b>
<img ... />
<div> more and more text </div>
</div><!-- This is the correct closing tag -->
How do I get the content between the opening of the div with class="content_text"
and its correct closing tag?
I tried regular expressions, but I couldn't find any easy or even hard way to do it.
I tried XPath, but I still couldn't get the content. Instead I got the text inside the outer div.
You can use the PHP Simple HTML DOM Parser to parse HTML like DOMDocument
would for XML.
Note: PHP has support for DOMDocument directly as well.
$scrape_address = "http://www.al-madina.com/node/444862";
$ch = curl_init($scrape_address);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, '1');
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_ENCODING, "");
$data = curl_exec($ch);
// I couldn't get an element by Attribute so I just replaced class to id
$data = str_replace('class="content_text"','id="my_unique_id"',$data);
$domd = new DOMDocument();
libxml_use_internal_errors(true);
$domd->loadHTML($data);
libxml_use_internal_errors(false);
$div = $domd->getElementById("my_unique_id");
if ($div) {
$dom2 = new DOMDocument();
$dom2->appendChild($dom2->importNode($div, true));
echo $dom2->saveHTML();
} else {
echo "Nothing found";
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With