Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Xpath with html5lib in PHP

I have this basic code that doesn't work. How can I use Xpath with html5lib php? Or Xpath with HTML5 in any other way.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url);

$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);

$xpath = new DOMXPath($dom);

$elements = $xpath->query('//h1');
//$elements = $dom->getElementsByTagName('h1');

foreach ($elements as $element)

No elements are found. Using $xpath->query('.') works for getting the root element (xpath in general seems to work). $dom->getElementsByTagName('h1') is working.

like image 280
Znarkus Avatar asked Jan 09 '23 23:01


2 Answers

use disable_html_ns option.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5(array(
    'disable_html_ns' => true, // add `disable_html_ns` option
$dom = $html5->loadHTML($response);

$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');

foreach ($elements as $element) {


disable_html_ns (boolean): Prevents the parser from automatically assigning the HTML5 namespace to the DOM document. This is for non-namespace aware DOM tools.

like image 163
sounisi5011 Avatar answered Jan 12 '23 12:01


So it looks like html5lib is setting us up with a default namespace.

$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);
$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {
    echo $de->namespaceURI . "\n";

This outputs:


To query against namespaced nodes with xpath you need to register the namespace and use the prefix in the query.

$xpath = new DOMXPath($dom);
$xpath->registerNamespace('n', $de->namespaceURI);

$elements = $xpath->query('//n:h1');
foreach ($elements as $element)
    echo $element->nodeValue;

This outputs PHP.

Generally I find it tedious to prefix everything in xpath queries when there's a default namespace involved, so I just strip it.

$de = $dom->documentElement;
$dom->loadXML($dom->saveXML()); // reload the existing dom, now sans default ns

After that you can use your original xpath and it'll work just fine.

$elements = $xpath->query('//h1');
foreach ($elements as $element)
    echo $element->nodeValue;

This now outputs PHP as well.

So the modified version of the example would be something like:


$url = 'http://en.wikipedia.org/wiki/PHP';
$response = GuzzleHttp\get($url)->getBody();
$html5 = new Masterminds\HTML5();
$dom = $html5->loadHTML($response);

$de = $dom->documentElement;
if ($de->isDefaultNamespace($de->namespaceURI)) {

$xpath = new DOMXPath($dom);
$elements = $xpath->query('//h1');
foreach ($elements as $element)


class DOMElement#11 (18) {
  public $tagName =>
  string(2) "h1"
  public $schemaTypeInfo =>
  public $nodeName =>
  string(2) "h1"
  public $nodeValue =>
  string(3) "PHP"
  public $textContent =>
  string(3) "PHP"
like image 27
user3942918 Avatar answered Jan 12 '23 12:01
