When I copy paste this Wikipedia article it looks like this.
http://en.wikipedia.org/wiki/Gruy%C3%A8re_%28cheese%29
However if you paste this back into the URL address the percent signs disappear and what appears to be Unicode characters ( and maybe special URL characters ) take the place of the percent signs.
Are these abbreviations for Unicode and special URL characters?
I'm use to seeing \u00ff, etc. in JavaScript.
The reference you're looking for is RFC 3987: Internationalized Resource Identifiers, specifically the section on mapping IRIs to URIs.
RFC 3986: Uniform Resource Identifiers specifies that reserved characters must be percent-encoded, but it also specifies that percent-encoded characters are decoded to US-ASCII, which does not include characters such as è
.
RFC 3987 specifies that non-ASCII characters should first be encoded as UTF-8 so they can be percent-encoded as per RFC 3986. If you'll permit me to illustrate in Python:
>>> u'è'.encode('utf-8')
'\xc3\xa8'
Here I've asked Python to encode the Unicode è
to a string of bytes using UTF-8. The bytes returned are 0xc3
and 0xa8
. Percent-encoded, this looks like %C3%A8
.
The parenthesis also appearing in your URL do fit in US-ASCII, so they are percent-escaped with their US-ASCII code points, which are also valid UTF-8.
So, no, there is no simple 16×16 table—such a table could never represent the richness of Unicode. But there is a method to the apparent madness.
%
in a URI is followed by two characters from 0-9A-F
, and is the escaped version of writing the character with that hex code. Doing this means you can write a URI with characters that might have special meaning in other languages.
Common examples are %20
for a space and %5B
and %5C
for [
and ]
, respectively.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With