Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transform Javascript XPath in valid PHP query() XPath | normalize JS XPath --> PHP

This is valid XPath in Javascript:

id("priceInfo")/div[@class="standardProdPricingGroup"]/span[1]

And this turned into valid PHP XPath to be used with DOMXPath->query() is

//*[@id="priceInfo"]//div[@class="standardProdPricingGroup"]//span[1]
  1. do you know any libraries or custom components that already do this transformation?
  2. do you know available documentation that lists the two syntax differences?

My main concern is that there could be a lot of differences, and I am looking to identify these differences, and I have problems to identify these.

The question could be put also in different way: Since Javascript can have different valid XPath formats, how to normalize them to work with the PHP.

One of the updates also mention that the id() function is valid XPath if there is a valid DTD that contains this definition. I don't have power over the input DTD, and if there is a way to find a solution that works without any specific DTD it would be awesome.

Update:

I want to transform the first format into the second with an algorithm. My input is the first one and not the second one. Can't change this.

As @Nison Maël pointed out, the 2nd format is valid Javascript XPath as presented here: http://jsbin.com/elatum/2/edit this unfortunately just adds to the problem of Javascript XPath "fragmentation".

@salathe pointed out that the valid Javascript XPath query works fine in PHP if the input documented has valid DTD ( @Dimitre Novatchev mentioned this in a comment, but overlooked the importance). Unfortunately I don't have control of the input DTD, so now I have to investigate a way to overcome this, or to find a solution that works even without valid DTD.

like image 875
Pentium10 Avatar asked Aug 03 '12 13:08

Pentium10


1 Answers

Just seeing that Salathe actually answered the same, but taking your comment into account and to stress this a bit more:

You do not need to specify any DTD. As long as you use the DOMDocument::loadHTML or DOMDocument::loadHTMLFile functions, the HTML id attribute is actually registered for the the xpath id() function. With the demo HTML given in http://jsbin.com/elatum/2/edit, you even get an error when you load the document:

Warning: DOMDocument::loadHTMLFile(): ID priceInfo already defined in ...

Which is already a sign that this is a true ID attribute because it moans about duplicates. A related sample code looks like:

$xpath = 'id("priceInfo")/div[@class="standardProdPricingGroup"]/span[1]';

$doc = new DOMDocument();
$doc->loadHTMLFile(__DIR__ . '/../data/file-11796340.html');
$xp = new DOMXPath($doc);

$r = $xp->query($xpath);
echo $xpath, "\n";
echo $r ? $r->length : 0, ' elements found', "\n";
if (!$r) return;
foreach($r as $node) {
    echo " - ", $node->nodeValue, "\n";
}

The output is:

id("priceInfo")/div[@class="standardProdPricingGroup"]/span[1]
1 elements found
 - hello

In case you need more control, first run an xpath to mark all HTML id attributes as ID for xpath:

$r = $xp->query("//*[@id]");
if ($r) foreach($r as $node) {
    $node->setIdAttribute('id', true);
}

You can then use the same xpath with the id() function, no need to change it.

like image 74
hakre Avatar answered Oct 24 '22 06:10

hakre