Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

php: Extract text between specific tags from a webpage [duplicate]

Possible Duplicate:
Best methods to parse HTML with PHP

I understand I should be using a html parser like php domdocument (http://docs.php.net/manual/en/domdocument.loadhtml.php) or tagsoup.

How would I use php domdocument to extract text between specific tags, for example get text between h1,h2,h3,p,table? It seems I can only do this for one tag only with getelementbytagname.

Is there a better html parser for such task? Or how would I loop through the php domdocument?

like image 848
giorgio79 Avatar asked Jan 01 '26 00:01

giorgio79


1 Answers

You are correct, use DomDocument (since regex is NOT a good idea for parsing HTML. Why? See here and here for reasons why).

getElementsByTagName gives you a DOMNodeList that you can iterate over to get the text of all the found elements. So, your code could look something like:

$document = new \DOMDocument();
$document->loadHTML($html);

$tags = array ('h1', 'h2', 'h3', 'h4', 'p');
$texts = array ();
foreach($tags as $tag)
{
  $elementList = $document->getElementsByTagName($tag);
  foreach($elementList as $element)
  {
     $texts[$element->tagName][] = $element->textContent;
  }
}
return $texts;

Note that you should probably have some error handling in there, and you will also lose the context of the texts, but you can probably edit this code as you see fit.

like image 121
PatrikAkerstrand Avatar answered Jan 02 '26 14:01

PatrikAkerstrand