I am loading a bunch of rss feeds using DOM and sometimes one will 404 instead of producing the file. The problem is that the web-server sends out an html 404 page in place of the expected xml file so using this code:
$rssDom = new DOMDocument();
$rssDom->load($url);
$channel = $rssDom->getElementsByTagName('channel');
$channel = $channel->item(0);
$items = $channel->getElementsByTagName('item');
I get this warning:
Warning: DOMDocument::load() [domdocument.load]: Entity 'nbsp' not defined
Followed by this error:
Fatal error: Call to a member function getElementsByTagName() on a non-object
Normally, this code works fine, but on the occasion that I get a 404 it fails to do anything. I tried a standard try-catch around the load statement but it doesn't seem to catch it.
You can suppress the output of parsing errors with
libxml_use_internal_errors(true);
To check whether the returned response is a 404 you can check the $http_response_header
after the call to DOMDocument::load()
Example:
libxml_use_internal_errors(true);
$rssDom = new DOMDocument();
$rssDom->load($url);
if (strpos($http_response_header[0], '404')) {
die('file not found. exiting.');
}
The alternative would be to use file_get_contents
and then check the response header and if its not a 404 load the markup with DOMDocument::loadXml
. This would prevent DOMDocument
from parsing invalid XML.
Note that all this assumes that the server correctly returns a 404 header in the response.
Load the HTML manually with file_get_contents
or curl
(which allows you to do your own error checks) and if all goes well then feed the results to DOMDocument::loadHTML
.
There are lots of curl
examples here (e.g. look at this one, although it's surely not the best); to get the HTTP status code you would use curl_getinfo
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With