Is it safe to assume decoded percent-encoded URIs turn into UTF-8?

Tags:

RFC 3986 states that new URI scheme should be encoded to UTF-8 first before being percent encoded. However, this does not apply to previous URI versions.

Is it safe to assume that all multibyte, percent encoded URI turns into UTF-8 encoded string after being passed through urldecode()?

For example, if the contents of $_SERVER['REQUEST_URI'] is being percent encoded as such:

/b%C3%BCch/w%C3%B6rterb%C3%BCch

After I pass this string to urldecode(), I should have a multibyte string. But how do I know in what encoding the string is? In the above example, it's UTF-8, but is it safe to always assume so?

If it's not safe to assume so, is there a way (other than mb_detect_encoding) to detect the encoding of the string? I've checked request headers, they don't seem to have anything helpful.

933

asked Oct 10 '11 19:10

rickchristie

Video Answer

1 Answers

Thank you for all the comments and answers! I have done some digging myself after I posted the question and would like to write it down here as a reference. Please let me know if this answer is wrong.

Skip to the end to go directly to the conclusion.

From the JETTY Docs on International Characters and Character Encoding, from the section "International characters in URLs", I found these paragraphs:

Due to the lack of a standard, different browers took different approaches to the character encoding used. Some use the encoding of the page and some use UTF-8. Some drafts were prepared by various standards bodies suggesting that UTF-8 would become the standard encoding. Older versions of jetty (eg 4.0.x series) used UTF-8 as the default in anticipation of a standard being adopted. As a standard was not forthcoming, jetty-4.1.x reverted to a default encoding of ISO-8859-1.

The W3C organization's HTML standard now recommends the use of UTF-8: http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars and accordingly jetty-6 series uses a default of UTF-8.

On the linked HTML 4.0 spec, there is indeed a recommendation for clients to encode non-ASCII characters into UTF-8 first before percent-encoding it, so we know it has been a recommendation from W3C since HTML 4.0.

The example used on the page is this:

<A href="http://foo.org/Håkon">...</A>

While it later states that the same encoding should be applied to the fragment part, it doesn't say that if it also applies to query string.

Typing URLs into browsers

Firefox

As Pekka already mentioned, based on this link Firefox sends ISO-8859-1 encoded URI as late as 2007. Reading the link, this seems to be the default behavior for Firefox < 3.0. I'm not sure if this also applies to Firefox < 3.0 in Mac OS X, since default encoding in Mac is UTF-8.

I've tested Firefox 3.6.13 in Windows XP and Firefox 6 in both Windows 7 and Mac OS X. The Mac version sends everything in UTF-8, so it's nothing to worry about.

Firefox 3.6.13 and 6 in windows encodes query strings into ISO-8859-1 by default, but when you type characters that doesn't exist in ISO-8859-1 to the query string (α, for example), Firefox 3 switches the encoding of the entire query string to UTF-8. I'm pretty sure this is the same behavior in later versions too.

In Firefox 3.6.13 and 6 in Windows that I tested, the path part of the URI is always encoded as UTF-8.

If you type this URL to Firefox 3.6/6 in Windows:

http://localhost/test/ü/ä/index.php?chär=ü

The query string gets encoded as ISO-8859-1, but the 'path' part gets encoded as UTF-8:

http://localhost//test/%C3%BC/%C3%A4/index.php?ch%E4r=%FC

Also to be noted, according to this blog post, Firefox 3.0 converts katanaka character ア into ア before percent-encoding it. When I tried to do this in Firefox 3.6.13 in the query string and the path, the katanaka character gets encoded in UTF-8 correctly.

Opera

Opera 10.10 on Mac encodes the query string part of the URI into ISO-8859-1, even though the default encoding for Mac OS X is UTF-8. The 'path' part gets encoded into UTF-8, just like Firefox.

If you try to type greek alphabet α to the query string it gets sent as a question mark.

The same behavior is exhibited by Opera 11.51 in Windows XP.

Safari

Safari 5.1 on Mac always sends everything as UTF-8. Safari 5.1 in Windows exhibit the same behavior.

Chrome

Version 13 on Windows encodes both query string and path as UTF-8. I don't have Chrome on Mac, but it seems safe to assume that Chrome always sends UTF-8, like Safari.

Internet Explorer

DISCLAIMER: I use IECollection to install multiple versions of IE in one machine, so this may not be IE's natural behavior (anyone can confirm on this?).

IE 6, 7, and 8 in Windows XP encodes 'path' part of the URI into UTF-8 correctly. Umlauts and greek alphabet typed to the query string does not get percent encoded though. The query string typed to the address bar seems to be sent in ISO-8859-1, the greek alphabet alpha 'α' in the query string gets transliterated into 'a'.

Conclusion

This is short and incomplete, and I cannot guarantee the correctness of it, but it seems that the most common encodings for URIs are either ISO-8859-1 and UTF-8 (I have no idea what east asians use as their encoding, and it is too exhaustive for me to try and find out).

Since it is already a recommendation from HTML 4.0, I guess it's safe to assume the 'path' part of the URI is always encoded in UTF-8. Firefox 2.0 might still be around, so you must check if the encoding is ISO-8859-1 too. If it's not UTF-8 or ISO-8859-1, most likely it's a bad request.

It's theoretically impossible to correctly detect the encoding of of a string (see here, and here). You can guess, but you can get the wrong result. So don't rely on encoding detection.

Safe Multibyte Routing

The safest way is just to choose one encoding (UTF-8 is the safest bet) for your entire application. Then you have to:

Make sure that all your strings are encoded in UTF-8 before using it to build your URI. Properly percent encode your URI after that.
Make sure all your URL encoded (GET) forms sends their data in the proper encoding. See this FAQ by Kore Nordmann for more information about making sure your forms send the correct encoding.

Also see this great answer from bobince.

After this, you shouldn't have any problems parsing the URI. If the encoding is not in UTF-8, then it's a bad request, and you can respond with 404 or 400 page.

answered Sep 27 '22 22:09

rickchristie

Related questions
                            
                                Guzzle not behaving like CURL
                            
                                How can I handle File Uploads in a Microservice Environment?
                            
                                Logging on ISP Config broke my PHP sessions and cookies forever
                            
                                Generated Doctrine models respect case, but generated Yaml does not
                            
                                Format of Exported Reflection Class in PHP?
                            
                                x-sendfile alternative for Apache to download huge files with resume-support
                            
                                After SP Insert, next page have an empty result until reloaded
                            
                                Appropriate pattern for ActiveRecord class
                            
                                Traits - property conflict with parent class
                            
                                Magento - Display product and category body page for each design in admin
                            
                                blade template, @yield() in @yield
                            
                                Apache behind corporate proxy
                            
                                Generate PDF from .docx generated by PHPWord
                            
                                Composer/WordPress : wp-content directory should or should not be committed
                            
                                Getting 'Forbidden error' when trying to execute youtube analytic API
                            
                                auto_prepend_file breaks xdebug
                            
                                Adding ajax load more button to my front page
                            
                                Stored procedure is not returning data
                            
                                How to use Google Docs for Mailer templates?
                            
                                PHP doctrine 1.2 ORM - polymorphic queries with class table inheritance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is it safe to assume decoded percent-encoded URIs turn into UTF-8?

Tags:

http

php

uri

rickchristie

People also ask

Video Answer

1 Answers

Typing URLs into browsers

Conclusion

rickchristie

Recent Activity

Donate For Us