Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Xpath with html5lib in PHP

I have this basic code that doesn't work. How can I use Xpath with html5lib php? Or Xpath with HTML5 in any other way.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url);

$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//h1');
//$elements = $dom->getElementsByTagName('h1');

foreach ($elements as $element)
{
    var_dump($element);
}

No elements are found. Using $xpath->query('.') works for getting the root element (xpath in general seems to work). $dom->getElementsByTagName('h1') is working.

like image 280
Znarkus Avatar asked Jan 09 '23 23:01

Znarkus


2 Answers

use disable_html_ns option.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5(array(
    'disable_html_ns' => true, // add `disable_html_ns` option
));
$dom = $html5->loadHTML($response);

$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');

foreach ($elements as $element) {
    var_dump($element);
}

https://github.com/Masterminds/html5-php#options

disable_html_ns (boolean): Prevents the parser from automatically assigning the HTML5 namespace to the DOM document. This is for non-namespace aware DOM tools.

like image 163
sounisi5011 Avatar answered Jan 12 '23 12:01

sounisi5011


So it looks like html5lib is setting us up with a default namespace.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
    echo $de->namespaceURI . "\n";
}

This outputs:

 http://www.w3.org/1999/xhtml

To query against namespaced nodes with xpath you need to register the namespace and use the prefix in the query.

$xpath = new DOMXPath($dom);
$xpath->registerNamespace('n', $de->namespaceURI);

$elements = $xpath->query('//n:h1');
foreach ($elements as $element)
{
    echo $element->nodeValue;
}

This outputs PHP.


Generally I find it tedious to prefix everything in xpath queries when there's a default namespace involved, so I just strip it.

$de = $dom->documentElement;
$de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
$dom->loadXML($dom->saveXML()); // reload the existing dom, now sans default ns

After that you can use your original xpath and it'll work just fine.

$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
    echo $element->nodeValue;
}

This now outputs PHP as well.


So the modified version of the example would be something like:

Example:

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);

$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
    $de->removeAttributeNS($de->getAttributeNode("xmlns")->nodeValue,"");
    $dom->loadXML($dom->saveXML());
}

$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element)
{
    var_dump($element);
}

Output:

class DOMElement#11 (18) {
  public $tagName =>
  string(2) "h1"
  public $schemaTypeInfo =>
  NULL
  public $nodeName =>
  string(2) "h1"
  public $nodeValue =>
  string(3) "PHP"
  ...
  public $textContent =>
  string(3) "PHP"
}
like image 27
user3942918 Avatar answered Jan 12 '23 12:01

user3942918