This is valid XPath in Javascript:
id("priceInfo")/div[@class="standardProdPricingGroup"]/span[1]
And this turned into valid PHP XPath to be used with DOMXPath->query() is
//*[@id="priceInfo"]//div[@class="standardProdPricingGroup"]//span[1]
My main concern is that there could be a lot of differences, and I am looking to identify these differences, and I have problems to identify these.
The question could be put also in different way: Since Javascript can have different valid XPath formats, how to normalize them to work with the PHP.
One of the updates also mention that the id() function is valid XPath if there is a valid DTD that contains this definition. I don't have power over the input DTD, and if there is a way to find a solution that works without any specific DTD it would be awesome.
Update:
I want to transform the first format into the second with an algorithm. My input is the first one and not the second one. Can't change this.
As @Nison Maël pointed out, the 2nd format is valid Javascript XPath as presented here: http://jsbin.com/elatum/2/edit this unfortunately just adds to the problem of Javascript XPath "fragmentation".
@salathe pointed out that the valid Javascript XPath query works fine in PHP if the input documented has valid DTD ( @Dimitre Novatchev mentioned this in a comment, but overlooked the importance). Unfortunately I don't have control of the input DTD, so now I have to investigate a way to overcome this, or to find a solution that works even without valid DTD.
Just seeing that Salathe actually answered the same, but taking your comment into account and to stress this a bit more:
You do not need to specify any DTD. As long as you use the DOMDocument::loadHTML
or DOMDocument::loadHTMLFile
functions, the HTML id
attribute is actually registered for the the xpath id()
function. With the demo HTML given in http://jsbin.com/elatum/2/edit, you even get an error when you load the document:
Warning: DOMDocument::loadHTMLFile(): ID priceInfo already defined in ...
Which is already a sign that this is a true ID attribute because it moans about duplicates. A related sample code looks like:
$xpath = 'id("priceInfo")/div[@class="standardProdPricingGroup"]/span[1]';
$doc = new DOMDocument();
$doc->loadHTMLFile(__DIR__ . '/../data/file-11796340.html');
$xp = new DOMXPath($doc);
$r = $xp->query($xpath);
echo $xpath, "\n";
echo $r ? $r->length : 0, ' elements found', "\n";
if (!$r) return;
foreach($r as $node) {
echo " - ", $node->nodeValue, "\n";
}
The output is:
id("priceInfo")/div[@class="standardProdPricingGroup"]/span[1]
1 elements found
- hello
In case you need more control, first run an xpath to mark all HTML id
attributes as ID for xpath:
$r = $xp->query("//*[@id]");
if ($r) foreach($r as $node) {
$node->setIdAttribute('id', true);
}
You can then use the same xpath with the id()
function, no need to change it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With