Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HTML5 History.pushState mangles URL's containing percent encoded non-Ascii (Unicode) chars

In an OSS web app, we have JS code that performs some Ajax update (uses jQuery, not relevant). After the page update, a call is made to the html5 history interface History.pushState, in the following code:

var updateHistory = function(url) {
    var context = { state:1, rand:Math.random() };
    /* -----> bedfore the problem call <------- */
    History.pushState( context, "Questions", url );
    /* -----> after the problem call <------- */
    setTimeout(function (){
        /* HACK: For some weird reson, sometimes something overrides the above pushState so we re-aplly it
                 This might be caused by some other JS plugin.
                 The delay of 10msec allows the other plugin to override the URL.
        */
        History.replaceState( context, "Questions", url );
    }, 10);
};

[Please note: the full code segment is provided for context, the HACK part is not the issue of this question]

The app is i18n'ed and is using URL encoded Unicode segments in the URL's, so just before the marked problem call in the above code, the URL argument contains (as inspected in Firebug):

"/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/scope:all/sort:activity-desc/page:1/"

The encoded segment is utf-8 in percent encoding. The URL in the browser window is: (just for completeness, doesn't really matter)

http://<base-url>/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/

Just after the call, the URL displayed in the browser window changes to:

http://<base-url>/%C3%98%C2%A7%C3%99%C2%84%C3%98%C2%A3%C3%98%C2%B3%C3%98%C2%A6%C3%99%C2%84%C3%98%C2%A9/scope:all/sort:activity-desc/page:1/

The URL encoded segment is just mojibake, the result of using the wrong encoding at some level. The correct URL would've been:

http://<base-url>/%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9/scope:all/sort:activity-desc/page:1/

This behavior has been tested on both FF and Chrome.

The history interface specs don't mention anything about encoded URL's, but I assume the default standard for URL formation (utf-8 and percent encoding etc) would apply when using URL's in function calls for the interface.

Any idea on what's going on here.

Edit:

I wasn't paying attention to the uppercase H in History - this code is actually using the History.js wrapper for the history interface. I replaced with a direct call to history.pushState (notice the lowercase h) without going through the wrapper, and the code is working as expected as far as I can tell. The issue with the original code still stands - so an issue with the History.js library it seems.

like image 817
Basel Shishani Avatar asked Jun 17 '12 04:06

Basel Shishani


1 Answers

Update

As Doug S explains in the comments below, the latest version of History.js includes a fix for this behaviour. He also found that my solution caused double-encoding when used in browsers (such as IE 9 and below) which require the hash fallback, so I recommend that instead of using the fix detailed below, just download the latest version.

I've kept my original answer below, since it does explain what's going on in much more detail.


Basel found a resolution of sorts, but there's still some confusion about what's happening under the hood. This answer goes into detail about the problem and suggests a better fix. (You can skip straight to the fix if you want.)

The problem

First, open your browser's JS console and run this:

window.encodeURI(window.unescape('%D8%A7%D9%84%D8%A3%D8%B3%D8%A6%D9%84%D8%A9'))

Does that look familiar? It should—that's what your URL is being mangled to. The problem lies in the implementation of History.unescapeString, specifically this line:

tmp = window.unescape(result);

window.unescape is a DOM Level 0 function—which is to say, an unstandardised relic from the hoary days of Netscape 2. It uses the escaping rules defined in RFC 2396, according to which characters outside of the unreserved range (alphanumerics and a small set of punctuation symbols) are encoded as octets.

This works fine for the US-ASCII range, but not all (indeed, the vast majority) of the characters in UTF-8 can be represented in a single byte. Since URIs do not have a built-in way of representing the character set being used, window.unescape just assumes each character maps to a single octet and blithely mangles any that don't.

In this example, the first letter in your URL is the Arabic letter alef (ا), represented by two bytes: 0xD8 0xA7. window.unescape interprets these as two separate characters: 0x00 0xD8 (Ø—capital O with stroke) and 0x00 0xA7 (§—section sign).

This is a known issue with History.js.

The fix

As noted above by the asker, the issue can be sidestepped by using the native implementation of the History API instead of the History.js wrapper, i.e. history.pushState instead of History.pushState.

This works for browsers that support the History API, but loses the benefit of having a polyfill for those that don't. Fortunately, there's a better fix. Open up the History.js source you're referencing and find this line (~1059 in my copy):

tmp = window.unescape(result);

Replace it with:

tmp = window.unescape(encodeURIComponent(result));

Or, if you're using the compressed source, replace a.unescape(c) with a.unescape(encodeURIComponent(c)).

To test this change, I ran the History.js HTML5 jQuery test suite on a local web server inside an Arabic-named directory. Before making the change, test 14 fails; after the change, all tests passed.

Credit

Though I found the problem and solution independently, Damien Antipa deserves credit for finding it first and making a pull request with the fix.

like image 139
Jordan Gray Avatar answered Oct 22 '22 03:10

Jordan Gray