Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to force XPath to use UTF8?

I have an XHTML document being passed to a PHP app via Greasemonkey AJAX. The PHP app uses UTF8. If I output the POST content straight back to a textarea in the AJAX receiving div, everything is still properly encoded in UTF8.

When I try to parse using XPath

$dom = new DOMDocument();
$dom->loadHTML($raw2);
$xpath = new DOMXPath($dom);
$query = '//td/text()';
$nodes = $xpath->query($query);
foreach($nodes as $node) {
  var_dump($node->wholeText);
}

dumped strings are not utf8. How do I force DOM/XPath to use UTF8?

like image 629
Gordon Avatar asked Jul 20 '09 16:07

Gordon


People also ask

How does XPath’s Translate function work?

XPATH’s translate function uses a 1 to 1 matching strategy depending on the order of things (above, A translates to a). So we build our sanitized versions of the text based on the length of our UTF-8 strings which contain all those odd characters we will come across in the app.

How to find the XPath of a field in a form?

Now in all the forms find the table with id ‘tbl_testdm’. Within the table go to a specific row and column. Within the cell, if there are multiple inputs, then find an input where value = ‘Open RFS’, and this will give us the final XPath of the field. Assume that your intended web element lies in the Panel Table and has some common text.

What is the default UTF 8 encoding for PowerShell output?

utf 8 - Changing PowerShell's default output encoding to UTF-8 - Stack Overflow By default, when you redirect the output of a command to a file or pipe it into something else in PowerShell, the encoding is UTF-16, which isn't useful.

How do I force UTF-8 encoding on my website?

If you need to force UTF-8 encoding across your website, here’s how you do it! All you have to do is add this code into your .htaccess file and save it! Like all file changes, it can take some time before it updates everywhere, but you can speed up the process on your devices simply by clearing your browser’s cache.


2 Answers

I had the same problem and I couldn't use tidy in my webserver. I found this solution and it worked fine:

$html = mb_convert_encoding($html, 'HTML-ENTITIES', "UTF-8");
$dom = new DomDocument();
$dom->loadHTML($html); 
like image 172
Lucia Avatar answered Sep 22 '22 16:09

Lucia


A bit late in the game, but perhaps it helps someone...

The problem might be in the output, and not in the dom/xpath object itself.

If you would output the nodeValue directly, you would get corrupted characters e.g.:

ìÂÂì ë¹Â디ì¤
ìì ë¹ë””ì¤ í°ì  íì¤

You have to load your dom object with the second param "utf-8", new \DomDocument('1.0', 'utf-8'), but still when you print the dom node list/element value you get broken characters:

echo $contentItem->item($index)->nodeValue

you have to wrap it up with utf8_decode:

echo utf8_decode($contentItem->item($index)->nodeValue) //output: 者不終朝而會,愚者可浹旬而學

like image 27
Kuko Kukanovic Avatar answered Sep 20 '22 16:09

Kuko Kukanovic