Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse HTML with PHP's HTML DOMDocument

Tags:

I was trying to do it with "getElementsByTagName", but it wasn't working, I'm new to using DOMDocument to parse HTML, as I used to use regex until yesterday some kind fokes here told me that DOMEDocument would be better for the job, so I'm giving it a try :)

I google around for a while looking for some explains but didn't find anything that helped (not with the class anyway)

So I want to capture "Capture this text 1" and "Capture this text 2" and so on.

Doesn't look to hard, but I can't figure it out :(

<div class="main">     <div class="text">     Capture this text 1     </div> </div>  <div class="main">     <div class="text">     Capture this text 2     </div> </div> 
like image 439
Mint Avatar asked Apr 03 '10 12:04

Mint


People also ask

How parse HTML in PHP?

To add the dynamic data (HTML content) at a certain point in PHP code, we need parsing. For example: For adding the data (info) in the form of HTML, we need to make that dynamic template in string and then convert it to HTML. How should we do parsing? We should use loadHTML() function for parsing.

How do you parse HTML?

If you just want to parse HTML and your HTML is intended for the body of your document, you could do the following : (1) var div=document. createElement("DIV"); (2) div. innerHTML = markup; (3) result = div. childNodes; --- This gives you a collection of childnodes and should work not just in IE8 but even in IE6-7.

What is parsing an HTML?

Parsing means analyzing and converting a program into an internal format that a runtime environment can actually run, for example the JavaScript engine inside browsers. The browser parses HTML into a DOM tree. HTML parsing involves tokenization and tree construction.


2 Answers

If you want to get :

  • The text
  • that's inside a <div> tag with class="text"
  • that's, itself, inside a <div> with class="main"

I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).

Instead, I would use an XPath query on your document, using the DOMXpath class.


For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :

$html = <<<HTML <div class="main">     <div class="text">     Capture this text 1     </div> </div>  <div class="main">     <div class="text">     Capture this text 2     </div> </div> HTML;  $dom = new DOMDocument(); $dom->loadHTML($html);  $xpath = new DOMXPath($dom); 


And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]'); foreach ($tags as $tag) {     var_dump(trim($tag->nodeValue)); } 


And executing this gives me the following output :

string 'Capture this text 1' (length=19) string 'Capture this text 2' (length=19) 
like image 88
Pascal MARTIN Avatar answered Oct 13 '22 10:10

Pascal MARTIN


You can use http://simplehtmldom.sourceforge.net/

It is very simple easy to use DOM parser written in php, by which you can easily fetch the content of div tag.

Something like this:

// Find all <div> which have attribute id=text $ret = $html->find('div[id=text]');  

See the documentation of it for more help.

like image 26
lokeshsk Avatar answered Oct 13 '22 08:10

lokeshsk