How can I find the offset of a particular node or attribute using the PHP DOM extension (or another extension or library if necessary).
For example, say I have this HTML document:
<html><a href="/foo">bar</a></html>
And using the following code (with appropriate modifications):
$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
foreach($nodes as $href) {
// Find start of $href attribute here
echo $href->something;
}
I'd expect to see the output 15 or something to that effect, to indicate that the attribute starts at character 15 into the document.
There seems to be the method DOMNode::getLineNo()
which returns the line number – this is similar to what I want but I can't find an alternative for the general offset into the text.
After finding the attribute you want,
$html = <<<HTML
<html><a href="/foo">bar</a></html>
HTML;
$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$nodes = $xpath->query('//a/@href');
$mySecretId = 'abc123';
foreach($nodes as $href) {
$href->value = $mySecretId;
}
$html = $dom->saveHTML();
echo strpos($html, $mySecretId) . "\n";
"strpos" will give you the first occurrence of the replaced value, which is the position you want.
Note flags "LIBXML_HTML_NOIMPLIED" and "LIBXML_HTML_NODEFDTD", more here.
If you want to find all positions of the matched elements, do:
foreach($nodes as $href) {
$previousValue = $href->value;
$href->value = $mySecretId;
$html = $dom->saveHTML();
echo strpos($html, $mySecretId) . "\n";
$href->value = $previousValue;
}
The following is based on some assumptions:
a.href
attributes are the only candidates that shall be handled - in case it shall be more the used regular expression pattern might become (too) complicateda.href
attributes are always encapsulated in double quotes "
and the value of the attribute node must not be emptya.href
attributes occur multiple times in the very same node, the last occurrence takes precedencepreg_match_all
with offset-capture<?php
// define some HTML, could be retrieved by e.g. file_get_contents() as well
$html = <<< HTML
<!DOCTYPE html>
<html lang="en">
<body>
<a href="https://google.com/">Google</a><div><a href=
"https://stackoverflow.com/">StackOverflow</a></div>
<A HREF="https://google.com/" href="https://goo.gl/">
Google URL</a>
</body>
</html>
HTML;
// search href attributes in anchor tags (case insensitive & multi-line)
preg_match_all(
'#<a[^>]*\s+href\s*=\s*"(?P<value>[^"]*)"[^>]*>#mis',
$html,
$matches,
PREG_OFFSET_CAPTURE
);
$positions = array_map(
function (array $match) {
$length = mb_strlen($match[0]);
return [
'value' => $match[0],
'length' => $length,
'start' => $match[1],
'end' => $match[1] + $length,
];
},
$matches['value']
);
var_dump($positions);
will output the position information like the following (note: the last position uses the second href
attribute which has been defined twice for the very same anchor tag)
array(3) {
[0] => array(4) {
'value' => string(19) "https://google.com/"
'length' => int(19)
'start' => int(49)
'end' => int(68)
}
[1] => array(4) {
'value' => string(26) "https://stackoverflow.com/"
'length' => int(26)
'start' => int(95)
'end' => int(121)
}
[2] => array(4) {
'value' => string(15) "https://goo.gl/"
'length' => int(15)
'start' => int(183)
'end' => int(198)
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With