Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Catch 404 error on DOMDocument->load()

Tags:

dom

php

xml

rss

I am loading a bunch of rss feeds using DOM and sometimes one will 404 instead of producing the file. The problem is that the web-server sends out an html 404 page in place of the expected xml file so using this code:

$rssDom = new DOMDocument();
$rssDom->load($url);
$channel = $rssDom->getElementsByTagName('channel');
$channel = $channel->item(0);
$items = $channel->getElementsByTagName('item');

I get this warning:

Warning: DOMDocument::load() [domdocument.load]: Entity 'nbsp' not defined

Followed by this error:

Fatal error: Call to a member function getElementsByTagName() on a non-object

Normally, this code works fine, but on the occasion that I get a 404 it fails to do anything. I tried a standard try-catch around the load statement but it doesn't seem to catch it.

like image 729
fishpen0 Avatar asked Dec 27 '22 01:12

fishpen0


2 Answers

You can suppress the output of parsing errors with

libxml_use_internal_errors(true);

To check whether the returned response is a 404 you can check the $http_response_header after the call to DOMDocument::load()

Example:

libxml_use_internal_errors(true);
$rssDom = new DOMDocument();
$rssDom->load($url);
if (strpos($http_response_header[0], '404')) {
    die('file not found. exiting.');
}

The alternative would be to use file_get_contents and then check the response header and if its not a 404 load the markup with DOMDocument::loadXml. This would prevent DOMDocument from parsing invalid XML.

Note that all this assumes that the server correctly returns a 404 header in the response.

like image 196
Gordon Avatar answered Dec 31 '22 14:12

Gordon


Load the HTML manually with file_get_contents or curl (which allows you to do your own error checks) and if all goes well then feed the results to DOMDocument::loadHTML.

There are lots of curl examples here (e.g. look at this one, although it's surely not the best); to get the HTTP status code you would use curl_getinfo.

like image 37
Jon Avatar answered Dec 31 '22 13:12

Jon