Possible Duplicate:
Best methods to parse HTML with PHP
I understand I should be using a html parser like php domdocument (http://docs.php.net/manual/en/domdocument.loadhtml.php) or tagsoup.
How would I use php domdocument to extract text between specific tags, for example get text between h1,h2,h3,p,table? It seems I can only do this for one tag only with getelementbytagname.
Is there a better html parser for such task? Or how would I loop through the php domdocument?
You are correct, use DomDocument (since regex is NOT a good idea for parsing HTML. Why? See here and here for reasons why).
getElementsByTagName gives you a DOMNodeList that you can iterate over to get the text of all the found elements. So, your code could look something like:
$document = new \DOMDocument();
$document->loadHTML($html);
$tags = array ('h1', 'h2', 'h3', 'h4', 'p');
$texts = array ();
foreach($tags as $tag)
{
$elementList = $document->getElementsByTagName($tag);
foreach($elementList as $element)
{
$texts[$element->tagName][] = $element->textContent;
}
}
return $texts;
Note that you should probably have some error handling in there, and you will also lose the context of the texts, but you can probably edit this code as you see fit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With