RFC 3986 states that new URI scheme should be encoded to UTF-8 first before being percent encoded. However, this does not apply to previous URI versions.
Is it safe to assume that all multibyte, percent encoded URI turns into UTF-8 encoded string after being passed through urldecode()
?
For example, if the contents of $_SERVER['REQUEST_URI']
is being percent encoded as such:
/b%C3%BCch/w%C3%B6rterb%C3%BCch
After I pass this string to urldecode()
, I should have a multibyte string. But how do I know in what encoding the string is? In the above example, it's UTF-8, but is it safe to always assume so?
If it's not safe to assume so, is there a way (other than mb_detect_encoding
) to detect the encoding of the string? I've checked request headers, they don't seem to have anything helpful.
When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard. UTF-8 works well on almost every recent software, even on Windows.
Since URLs often contain characters outside the ASCII set, the URL has to be converted into a valid ASCII format. URL encoding replaces unsafe ASCII characters with a "%" followed by two hexadecimal digits. URLs cannot contain spaces. URL encoding normally replaces a space with a plus (+) sign or with %20.
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
Thank you for all the comments and answers! I have done some digging myself after I posted the question and would like to write it down here as a reference. Please let me know if this answer is wrong.
Skip to the end to go directly to the conclusion.
From the JETTY Docs on International Characters and Character Encoding, from the section "International characters in URLs", I found these paragraphs:
Due to the lack of a standard, different browers took different approaches to the character encoding used. Some use the encoding of the page and some use UTF-8. Some drafts were prepared by various standards bodies suggesting that UTF-8 would become the standard encoding. Older versions of jetty (eg 4.0.x series) used UTF-8 as the default in anticipation of a standard being adopted. As a standard was not forthcoming, jetty-4.1.x reverted to a default encoding of ISO-8859-1.
The W3C organization's HTML standard now recommends the use of UTF-8: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars and accordingly jetty-6 series uses a default of UTF-8.
On the linked HTML 4.0 spec, there is indeed a recommendation for clients to encode non-ASCII characters into UTF-8 first before percent-encoding it, so we know it has been a recommendation from W3C since HTML 4.0.
The example used on the page is this:
<A href="http://foo.org/Håkon">...</A>
While it later states that the same encoding should be applied to the fragment part, it doesn't say that if it also applies to query string.
Firefox
As Pekka already mentioned, based on this link Firefox sends ISO-8859-1 encoded URI as late as 2007. Reading the link, this seems to be the default behavior for Firefox < 3.0. I'm not sure if this also applies to Firefox < 3.0 in Mac OS X, since default encoding in Mac is UTF-8.
I've tested Firefox 3.6.13 in Windows XP and Firefox 6 in both Windows 7 and Mac OS X. The Mac version sends everything in UTF-8, so it's nothing to worry about.
Firefox 3.6.13 and 6 in windows encodes query strings into ISO-8859-1 by default, but when you type characters that doesn't exist in ISO-8859-1 to the query string (α, for example), Firefox 3 switches the encoding of the entire query string to UTF-8. I'm pretty sure this is the same behavior in later versions too.
In Firefox 3.6.13 and 6 in Windows that I tested, the path part of the URI is always encoded as UTF-8.
If you type this URL to Firefox 3.6/6 in Windows:
http://localhost/test/ü/ä/index.php?chär=ü
The query string gets encoded as ISO-8859-1, but the 'path' part gets encoded as UTF-8:
http://localhost//test/%C3%BC/%C3%A4/index.php?ch%E4r=%FC
Also to be noted, according to this blog post, Firefox 3.0
converts katanaka character ア into ア
before percent-encoding
it. When I tried to do this in Firefox 3.6.13 in the query string
and the path, the katanaka character gets encoded in UTF-8 correctly.
Opera
Opera 10.10 on Mac encodes the query string part of the URI into ISO-8859-1, even though the default encoding for Mac OS X is UTF-8. The 'path' part gets encoded into UTF-8, just like Firefox.
If you try to type greek alphabet α to the query string it gets sent as a question mark.
The same behavior is exhibited by Opera 11.51 in Windows XP.
Safari
Safari 5.1 on Mac always sends everything as UTF-8. Safari 5.1 in Windows exhibit the same behavior.
Chrome
Version 13 on Windows encodes both query string and path as UTF-8. I don't have Chrome on Mac, but it seems safe to assume that Chrome always sends UTF-8, like Safari.
Internet Explorer
DISCLAIMER: I use IECollection to install multiple versions of IE in one machine, so this may not be IE's natural behavior (anyone can confirm on this?).
IE 6, 7, and 8 in Windows XP encodes 'path' part of the URI into UTF-8 correctly. Umlauts and greek alphabet typed to the query string does not get percent encoded though. The query string typed to the address bar seems to be sent in ISO-8859-1, the greek alphabet alpha 'α' in the query string gets transliterated into 'a'.
This is short and incomplete, and I cannot guarantee the correctness of it, but it seems that the most common encodings for URIs are either ISO-8859-1 and UTF-8 (I have no idea what east asians use as their encoding, and it is too exhaustive for me to try and find out).
Since it is already a recommendation from HTML 4.0, I guess it's safe to assume the 'path' part of the URI is always encoded in UTF-8. Firefox 2.0 might still be around, so you must check if the encoding is ISO-8859-1 too. If it's not UTF-8 or ISO-8859-1, most likely it's a bad request.
It's theoretically impossible to correctly detect the encoding of of a string (see here, and here). You can guess, but you can get the wrong result. So don't rely on encoding detection.
Safe Multibyte Routing
The safest way is just to choose one encoding (UTF-8 is the safest bet) for your entire application. Then you have to:
Also see this great answer from bobince.
After this, you shouldn't have any problems parsing the URI. If the encoding is not in UTF-8, then it's a bad request, and you can respond with 404 or 400 page.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With