Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DOMDocument->saveHTML() vs urlencode with commercial at symbol (@)

Using DOMDocument(), I'm replacing links in a $message and adding some things, like [@MERGEID]. When I save the changes with $dom_document->saveHTML(), the links get "sort of" url-encoded. [@MERGEID] becomes %5B@MERGEID%5D.

Later in my code I need to replace [@MERGEID] with an ID. So I search for urlencode('[@MERGEID]') - however, urlencode() changes the commercial at symbol (@) to %40, while saveHTML() has left it alone. So there is no match - '%5B@MERGEID%5D' != '%5B%40MERGEID%5D'

Now, I know can run str_replace('%40', '@', urlencode('[@MERGEID]')) to get what I need to locate the merge variable in $message.

My question is, what RFC spec is DOMDocument using, and why is it different than urlencode or even rawurlencode? Is there anything I can do about that to save a str_replace?

Demo code:

$message = '<a href="http://www.google.com?ref=abc" data-tag="thebottomlink">Google</a>';
$dom_document = new \DOMDocument();
libxml_use_internal_errors(true); //Supress content errors
$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'));       
$elements = $dom_document->getElementsByTagName('a');
foreach($elements as $element) {    
    $link = $element->getAttribute('href'); //http://www.google.com?ref=abc
    $tag = $element->getAttribute('data-tag'); //thebottomlink
    if ($link) {
        $newlink = 'http://www.example.com/click/[@MERGEID]?url=' . $link;
        if ($tag) {
            $newlink .= '&tag=' . $tag;
        } 
        $element->setAttribute('href', $newlink);
    }
}
$message = $dom_document->saveHTML();
$urlencodedmerge = urlencode('[@MERGEID]');
die($message . ' and url encoded version: ' . $urlencodedmerge); 
//<a data-tag="thebottomlink" href="http://www.example.com/click/%5B@MERGEID%5D?url=http://www.google.com?ref=abc&amp;tag=thebottomlink">Google</a> and url encoded version: %5B%40MERGEID%5D
like image 701
Luke Shaheen Avatar asked Dec 04 '14 19:12

Luke Shaheen


3 Answers

I believe that those two encoding serve different purposes. urlencode() encodes "a string to be used in a query part of a URL", while $element->setAttribute('href', $newlink); encodes a complete URL to be used as an URL.

For example:

urlencode('http://www.google.com'); // -> http%3A%2F%2Fwww.google.com

This is convenient for encoding the query part, but it cannot be used on <a href='...'>.

However:

$element->setAttribute('href', $newlink); // -> http://www.google.com

will properly encode the string so that it is still usable in href. The reason that it cannot encode @ because it cannot tell whether @ is a part of the query or is it part of the userinfo or email url (for example: mailto:[email protected] or [email protected])


Solution

  1. Instead of using [@MERGEID], you can use @@MERGEID@@. Then, you replace that with your ID later. This solution does not require you to even use urlencode.

  2. If you insist to use urlencode, you can just use %40 instead of @. So, your code will be like this $newlink = 'http://www.example.com/click/[%40MERGEID]?url=' . $link;

  3. You can also do something like $newlink = 'http://www.example.com/click/' . urlencode('[@MERGEID]') . '?url=' . $link;

like image 92
invisal Avatar answered Nov 02 '22 11:11

invisal


urlencode function and rawurlencode are mostly based on RFC 1738. However, since 2005 the current RFC in use for URIs standard is RFC 3986.

On the other hand, The DOM extension uses UTF-8 encoding, which is based on RFC 3629 . Use utf8_encode() and utf8_decode() to work with texts in ISO-8859-1 encoding or Iconv for other encodings.

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values.

Here is a function to decode URLs according to RFC 3986.

<?php
    function myUrlEncode($string) {
       $entities = array('%21', '%2A', '%27', '%28', '%29', '%3B', '%3A', '%40', '%26', '%3D', '%2B', '%24', '%2C', '%2F', '%3F', '%25', '%23', '%5B', '%5D');
       $replacements = array('!', '*', "'", "(", ")", ";", ":", "@", "&", "=", "+", "$", ",", "/", "?", "%", "#", "[", "]");
       return str_replace($entities, $replacements, urldecode($string));
    }
?>

PHP Fiddle.


Update:

Since UTF8 has been used to encode $message:

$dom_document->loadHTML(mb_convert_encoding($message, 'HTML-ENTITIES', 'UTF-8'))

Use urldecode($message) when returning the URL without percents.

die(urldecode($message) . ' and url encoded version: ' . $urlencodedmerge); 
like image 35
carlodurso Avatar answered Nov 02 '22 11:11

carlodurso


The root cause of your problem has been very well explained from a technical point of view.

In my opinion, however, there is a conceptual flaw in your approach, and it created the situation that you are now trying to fix.

By processing your input $message through a DomDocument object, you have moved to a higher level of abstraction. It is wrong to manipulate as a unique plain string something that has been "promoted" to a HTML stream.

Instead of trying to reproduce DomDocument's behaviour, use the library itself to locate, extract and replace the values of interest:

$token = 'blah blah [@MERGEID]';
$message = '<a id="' . $token . '" href="' . $token . '"></a>';

$dom = new DOMDocument();
$dom->loadHTML($message);
echo $dom->saveHTML(); // now we have an abstract HTML document

// extract a raw value
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('href');
// do the low-level fiddling
$newstring = str_replace($token, 'replaced', $rawstring);
// push the new value back into the abstract black box.
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', $newstring);

// less code written, but works all the time
$rawstring = $dom->getElementsByTagName('a')->item(0)->getAttribute('id');
$newstring = str_replace($token, 'replaced', $rawstring);
$dom->getElementsByTagName('a')->item(0)->setAttribute('id', $newstring);

echo $dom->saveHTML();

As illustrated above, today we are trying to fix the problem when your token is inside a href, but one day we may want to search and replace the tag elsewhere in the document. To account for this case, do not bother making your low-level code HTML-aware.

(an alternative option would be not loading a DomDocument until all low-level replacements are done, but I am guessing this is not practical)


Complete proof of concept:

function searchAndReplace(DOMNode $node, $search, $replace) {
    if($node->hasAttributes()) {
        foreach ($node->attributes as $attribute) {
            $input = $attribute->nodeValue;
            $output = str_replace($search, $replace, $input);
            $attribute->nodeValue = $output;
        }
    }

    if(!$node instanceof DOMElement) { // this test needs double-checking
        $input = $node->nodeValue;
        $output = str_replace($search, $replace, $input);
        $node->nodeValue = $output;
    }

    if($node->hasChildNodes()) {
        foreach ($node->childNodes as $child) {
            searchAndReplace($child, $search, $replace);
        }
    }
}

$token = '<>&;[@MERGEID]';
$message = '<a/>';

$dom = new DOMDocument();
$dom->loadHTML($message);

$dom->getElementsByTagName('a')->item(0)->setAttribute('id', "foo$token");
$dom->getElementsByTagName('a')->item(0)->setAttribute('href', "http://foo@$token");
$textNode = new DOMText("foo$token");
$dom->getElementsByTagName('a')->item(0)->appendchild($textNode);

echo $dom->saveHTML();

searchAndReplace($dom, $token, '*replaced*');

echo $dom->saveHTML();
like image 2
RandomSeed Avatar answered Nov 02 '22 11:11

RandomSeed