Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bulletproofing SimpleXMLElement

Everyone knows that we should always use DOM techniques instead of regexes to extract content from HTML, but I get the feeling that I can never trust the SimpleXML extension or similar ones.

I'm coding a OpenID implementation right now, and I tried using SimpleXML to do the HTML discovery - but my very first test (with alixaxel.myopenid.com) yielded a lot of errors:

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 27: parser error : Opening and ending tag mismatch: link line 11 and head in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: </head> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 64: parser error : Entity 'copy' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: &copy; 2008 <a href="http://janrain.com/">JanRain, Inc.</a> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID&trade; and the myOpenID&trade; website are in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 66: parser error : Entity 'trade' not defined in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: myOpenID&trade; and the myOpenID&trade; website are in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 77: parser error : Opening and ending tag mismatch: link line 8 and html in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: </html> in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag head line 3 in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: Entity: line 78: parser error : Premature end of data in tag html line 2 in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: in E:\xampplite\htdocs\index.php on line 6

Warning: simplexml_load_string() [function.simplexml-load-string]: ^ in E:\xampplite\htdocs\index.php on line 6

I recall there was a way to make SimpleXML always parse a file, independently if the document contains errors or not - I can't remember the specific implementation though, but I think it involved using DOMDocument. What is the best way to make sure SimpleXML always parses any given document?

And please don't suggest using Tidy, I think the extension is slow and it's not available on many systems.

like image 647
Alix Axel Avatar asked Jul 22 '10 18:07

Alix Axel


1 Answers

You can load the HTML with DOM's loadHTML then import the result to SimpleXML.

IIRC, it will still choke on some stuff but it will accept pretty much anything that exists in the real world of broken websites.

$html = '<html><head><body><div>stuff & stuff</body></html>';

// disable PHP errors
$old = libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html);

// restore the old behaviour
libxml_use_internal_errors($old);

$sxe = simplexml_import_dom($dom);
die($sxe->asXML());
like image 153
Josh Davis Avatar answered Nov 19 '22 03:11

Josh Davis