php: Extract text between specific tags from a webpage [duplicate]

Question

Possible Duplicate:
Best methods to parse HTML with PHP

I understand I should be using a html parser like php domdocument (http://docs.php.net/manual/en/domdocument.loadhtml.php) or tagsoup.

How would I use php domdocument to extract text between specific tags, for example get text between h1,h2,h3,p,table? It seems I can only do this for one tag only with getelementbytagname.

Is there a better html parser for such task? Or how would I loop through the php domdocument?

PatrikAkerstrand · Accepted Answer

You are correct, use DomDocument (since regex is NOT a good idea for parsing HTML. Why? See here and here for reasons why).

getElementsByTagName gives you a DOMNodeList that you can iterate over to get the text of all the found elements. So, your code could look something like:

$document = new \DOMDocument();
$document->loadHTML($html);

$tags = array ('h1', 'h2', 'h3', 'h4', 'p');
$texts = array ();
foreach($tags as $tag)
{
  $elementList = $document->getElementsByTagName($tag);
  foreach($elementList as $element)
  {
     $texts[$element->tagName][] = $element->textContent;
  }
}
return $texts;

Note that you should probably have some error handling in there, and you will also lose the context of the texts, but you can probably edit this code as you see fit.

php: Extract text between specific tags from a webpage [duplicate]

Tags:

regex

php

html-parsing

giorgio79

1 Answers

PatrikAkerstrand

Recent Activity

Donate For Us

php: Extract text between specific tags from a webpage [duplicate]

Tags:

regex

php

html-parsing

giorgio79

1 Answers

PatrikAkerstrand

Related questions

Recent Activity

Donate For Us