Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode URL decoding

The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)

But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")?

Are 8-bit characters, that require encoding, preceded by %00?

Or, is the point that unicode characters are supposed to be lost/split?

like image 323
Jonathan Lonowski Avatar asked Oct 01 '08 01:10

Jonathan Lonowski


1 Answers

According to Wikipedia:

Current standard

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary data or as character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.

Non-standard implementations

There exists a non-standard encoding for Unicode characters: %uxxxx, where xxxx is a Unicode value represented as four hexadecimal digits. This behavior is not specified by any RFC and has been rejected by the W3C. The third edition of ECMA-262 still includes an escape(string) function that uses this syntax, but also an encodeURI(uri) function that converts to UTF-8 and percent-encodes each octet.

So, it looks like its entirely up to the person writing the unencode method...Aren't standards fun?

like image 169
FlySwat Avatar answered Dec 22 '22 06:12

FlySwat