Using DOMDocument()
, I'm replacing links in a $message
and adding some things, like [@MERGEID]
. When I save the changes with $dom_document->saveHTML()
, the links get "sort of" url-encoded. [@MERGEID]
becomes %5B@MERGEID%5D
.
Later in my code I need to replace [@MERGEID]
with an ID. So I search for urlencode('[@MERGEID]')
- however, urlencode()
changes the commercial at symbol (@) to %40, while saveHTML() has left it alone. So there is no match - '%5B@MERGEID%5D' != '%5B%40MERGEID%5D'
Now, I know can run str_replace('%40', '@', urlencode('[@MERGEID]'))
to get what I need to locate the merge variable in $message.
My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?
Demo code:
$message = '<a href="http://www.google.com?ref=abc" data-tag="thebottomlink">Google</a>';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {
$link = $element->getAttribute('href'); //http://www.google.com?ref=abc
$tag = $element->getAttribute('data-tag'); //thebottomlink
if ($link) {
$newlink = 'http://www.example.com/click/[@MERGEID]?url=' . $link;
if ($tag) {
$newlink .= '&tag=' . $tag;
}
$element->setAttribute('href', $newlink);
}
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[@MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge);
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B@MERGEID%5D?url=http://www.google.com?ref=abc&tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D
I believe that those two encoding serve different purposes. urlencode()
encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink);
encodes a complete URL to be used as an URL.
For example:
urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com
This is convenient for encoding the query part, but it cannot be used on <a href='...'>
.
However:
$element->setAttribute('href', $newlink); // -> http://www.google.com
will properly encode the string so that it is still usable in href
. The reason that it cannot encode @
because it cannot tell whether @
is a part of the query or is it part of the userinfo
or email
url (for example: mailto:[email protected]
or [email protected]
)
Instead of using [@MERGEID]
, you can use @@MERGEID@@
. Then, you replace that with your ID later. This solution does not require you to even use urlencode
.
If you insist to use urlencode
, you can just use %40 instead of @. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;
You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[@MERGEID]') . '?url=' . $link;
urlencode
function and rawurlencode
are mostly based on RFC 1738
. However, since 2005 the current RFC in use for URIs standard is RFC 3986
.
On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values.
Here is a function to decode URLs according to RFC 3986
.
<?php
function myUrlEncode($string) {
$entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
$replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
return str_replace($entities, $replacements, urldecode($string));
}
?>
PHP Fiddle.
Update:
Since UTF8 has been used to encode $message
:
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))
Use urldecode($message)
when returning the URL without percents.
die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge);
The root cause of your problem has been very well explained from a technical point of view.
In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.
By processing your input $message
through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.
Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:
$token = 'blah blah [@MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';
$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document
// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);
// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);
echo $dom->saveHTML();
As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.
(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)
Complete proof of concept:
function searchAndReplace(DOMNode $node, $search, $replace) {
if($node->hasAttributes()) {
foreach ($node->attributes as $attribute) {
$input = $attribute->nodeValue;
$output = str_replace($search, $replace, $input);
$attribute->nodeValue = $output;
}
}
if(!$node instanceof DOMElement) { // this test needs double-checking
$input = $node->nodeValue;
$output = str_replace($search, $replace, $input);
$node->nodeValue = $output;
}
if($node->hasChildNodes()) {
foreach ($node->childNodes as $child) {
searchAndReplace($child, $search, $replace);
}
}
}
$token = '<>&;[@MERGEID]';
$message = '<a/>';
$dom = new DOMDocument();
$dom->loadHTML($message);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo@$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);
echo $dom->saveHTML();
searchAndReplace($dom, $token, '*replaced*');
echo $dom->saveHTML();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With