Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.

Some interesting examples:

The heart character. If I type this into my browser:

http://www.google.com/search?q=♥ 

Then copy and paste it, I see this URL

http://www.google.com/search?q=%E2%99%A5 

which makes it seem like Firefox (or Safari) is doing this.

urllib.quote_plus(x.encode("latin-1")) '%E2%99%A5' 

which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.

If I type the URL

http://www.google.com/search?q=… 

into my browser then copy and paste, I get

http://www.google.com/search?q=%E2%80%A6 

back. Which seems to be the result of doing

urllib.quote_plus(x.encode("utf-8")) 

which makes sense since … can't be encoded with Latin-1.

But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.

Since this seems to be ambiguous:

In [67]: u"…".encode('utf-8').decode('latin-1') Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6' 

works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.

What's the right thing to be doing with the special characters I need to deal with?

like image 931
Josh Gibson Avatar asked May 26 '09 21:05

Josh Gibson


People also ask

How do I encode a Unicode?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

Can you have Unicode in URL?

Unicode contains many characters that have similar appearance to other characters. Allowing the full range of Unicode into a URL means that characters which look similar—or even identical to—other characters could be used to spoof users.

What is the best way to URL encode a string?

In JavaScript, PHP, and ASP there are functions that can be used to URL encode a string. PHP has the rawurlencode() function, and ASP has the Server. URLEncode() function. In JavaScript you can use the encodeURIComponent() function.

What does %20 in a URL mean?

A space is assigned number 32, which is 20 in hexadecimal. When you see “%20,” it represents a space in an encoded URL, for example, http://www.example.com/products%20and%20services.html.


1 Answers

I would always encode in UTF-8. From the Wikipedia page on percent encoding:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.

like image 96
John Biesnecker Avatar answered Oct 11 '22 03:10

John Biesnecker