Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument::loadHTML(): input conversion failed due to input error

I am looking to scrap a Chinese website using PHP and CURL. Earlier I had an issue with the compressed results and SO had helped me to sort it out. Now I'm facing a trouble while parsing the contents through PHP - DOMDocument. The error is as follows,

Warning: DOMDocument::loadHTML(): input conversion failed due to input error, bytes 0xE3 0x80 0x90 0xE8 in /var/www/html/ ..

Even though warning this is preventing from getting further results.

My code is as given below:

$agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0';
$curl = curl_init(); 
curl_setopt($curl, CURLOPT_URL,$url); 
curl_setopt($curl, CURLOPT_HTTPHEADER, array('text/html; charset=gb2312')); 
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);  
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
curl_setopt($curl, CURLOPT_ENCODING, "");  // handling all compressions 
curl_setopt($curl, CURLOPT_USERAGENT, $agent);
curl_setopt($curl, CURLOPT_TIMEOUT, 1000);
$html = curl_exec($curl) or die("error: ".curl_error($curl));
curl_close($curl);
$htmlParsed = mb_convert_encoding($result,'utf-8','gb2312');

$doc = new DOMDocument();
$doc->loadHTML($htmlParsed);

$xpath = new DOMXpath($doc);

$elements = $xpath->query('//div[@class="test"]//a/@href');

if (!is_null($elements)) {
  foreach ($elements as $element) {
    echo "<br/>[". $element->nodeName. "]";

    $nodes = $element->childNodes;
    foreach ($nodes as $node) {
      echo $node->nodeValue. "\n";
    }
  }
}

I found the content type in my target website as ,

<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />

So I tried converting result to utf-8.

Since the input conversion fails at 'DOMDocument::loadHTML()' line of the code ,I can't parse the web page to get the results. I am currently stuck at this point and any help or suggestions will be highly appreciated. Thanx in advance.

(Earlier I used to work with simple HTML DOM parser,which was pretty simple.But later after reading the cons in SO regarding its usage.I planned to switch to PHP's native DOM Parser )

like image 821
Surabhil Avatar asked Mar 20 '23 18:03

Surabhil


2 Answers

I see a solution today .

$html=new DOMDocument();  
$html_source    = get_html();
$html_source    =mb_convert_encoding( $html_source, "HTML-ENTITIES", "UTF-8");
$html->loadHTML( $html_source );
like image 154
jianyong Avatar answered Apr 06 '23 00:04

jianyong


Without seeing the full head of the document that you are parsing I can only guess, but if the with the character encoding data does not come directly after the tag, you may be running into a situation where DomDocument is using its default of ISO-8859-1 and running into the【 character (the first three "invalid" bytes in gb2312) of which the 0x80 byte would be the first bit of nonsense since this is an unused code point in ISO-8859-1. This would likely trigger the bug in DomDocument discussed in the comments above. And could easily happen if the element is included before the content-type meta information.

The only thing I can think of to try would be to run the html through a bit of prep and move that content-type meta tag to right after the tag to try to make it use the correct character set. If you use mb_convert_encoding or iconv to convert the encoding to iso-5589-1 or utf-8, make sure that you modify the meta information because DomDocument is, unfortunately, brittle in many ways.

like image 45
Reid Johnson Avatar answered Apr 06 '23 01:04

Reid Johnson