Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument loadHTML doesn't work properly on a server

Tags:

dom

php

I run the code first on MAMP and it worked very well. But when I tried to run the code on another server, I got a lot of warnings like:

Warning: DOMDocument::loadHTML(): Unexpected end tag : head in Entity, line: 3349 in /cgihome/zhang1/html/cgi-bin/getPrice.php on line 17 Warning: DOMDocument::loadHTML(): htmlParseStartTag: misplaced tag in Entity, line: 3350 in /cgihome/zhang1/html/cgi-bin/getPrice.php on line 17 Warning: DOMDocument::loadHTML(): Tag header invalid in Entity, line: 3517 in /cgihome/zhang1/html/cgi-bin/getPrice.php on line 17

The codes are following:

<?php
 $amazon = file_get_contents('http://www.amazon.com/blablabla');
 $doc = new DOMdocument();
 $doc->loadHTML($amazon);
 $doc->saveHTML();
 $price = $doc -> getElementById('actualPriceValue')->textContent;
 $ASIN = $doc -> getElementById('ASIN')->getAttribute('value');
?>

Anyone knows what's going on? Thanks!

like image 782
LuZ Avatar asked Aug 05 '12 19:08

LuZ


3 Answers

To disable the warning, you can use

libxml_use_internal_errors(true);

This works for me, Manual, read on:


Background: You are loading invalid HTML. Invalid HTML is quite common, DOMDocument::loadHTML corrects most of the problems, but gives warnings by default.

With libxml_use_internal_errors you can control that behavior. Set it before loading the document:

$previously = libxml_use_internal_errors(true);
$doc->loadHTML($amazon);

Then after loading you can deal with the errors (if you want/need to):

/* @var LibXMLError[] $xmlErrors */
$xmlErrors = libxml_get_errors();

And finally clear them (as they will add up) and restore the previous setting if applicable:

unset($xmlErrors);
libxml_clear_errors();
libxml_use_internal_errors($previously);

References

  • libxml_use_internal_errors Disable libxml errors and allow user to fetch error information as needed
  • libxml_clear_errors Clear libxml error buffer
  • libxml_get_errors Retrieve array of errors
  • LibXMLError The libXMLError class
  • Stackoverflow answer to DOMDocument PHP Memory Leak (by Tak; Dec 2011)
like image 111
hakre Avatar answered Oct 06 '22 16:10

hakre


This problem is related to non xHTML code

As DOMdocument() can only process clean XHTML you need to clean up your code

Php have an extension that does the job pretty well. Called Tidy php.net/book.tidy

It might be tricky as you may need to enable it in your php.ini

Then

$tidy_config = array( 
                     'clean' => true, 
                     'output-xhtml' => true, 
                     'show-body-only' => true, 
                     'wrap' => 0, 

                     ); 

$tidy = tidy_parse_string( $html, $tidy_config, 'UTF8'); 
$tidy->cleanRepair(); 
$doc = new DOMdocument();
$doc->loadHTML( (string) $tidy);
like image 34
Pascal Avatar answered Oct 06 '22 15:10

Pascal


You can surpress the warning like this:

@$doc->loadHTML($amazon);
like image 4
Aminah Nuraini Avatar answered Oct 06 '22 15:10

Aminah Nuraini