What is the proper way to URL encode Unicode characters?

Tags:

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.

Some interesting examples:

The heart character. If I type this into my browser:

http://www.google.com/search?q=♥

Then copy and paste it, I see this URL

http://www.google.com/search?q=%E2%99%A5

which makes it seem like Firefox (or Safari) is doing this.

urllib.quote_plus(x.encode("latin-1")) '%E2%99%A5'

which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.

…

If I type the URL

http://www.google.com/search?q=…

into my browser then copy and paste, I get

http://www.google.com/search?q=%E2%80%A6

back. Which seems to be the result of doing

urllib.quote_plus(x.encode("utf-8"))

which makes sense since … can't be encoded with Latin-1.

But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.

Since this seems to be ambiguous:

In [67]: u"…".encode('utf-8').decode('latin-1') Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'

works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.

What's the right thing to be doing with the special characters I need to deal with?

931

asked May 26 '09 21:05

Josh Gibson

1 Answers

I would always encode in UTF-8. From the Wikipedia page on percent encoding:

The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.

answered Oct 11 '22 03:10

John Biesnecker

Related questions
                            
                                What Unicode characters represent "time"?
                            
                                JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?
                            
                                UnicodeDecodeError when redirecting to file
                            
                                Any gotchas using unicode_literals in Python 2.6?
                            
                                How to set emoji by unicode in a textview?
                            
                                Why does Apache Commons consider '१२३' numeric?
                            
                                How do I sort unicode strings alphabetically in Python?
                            
                                What's the complete range for Chinese characters in Unicode?
                            
                                How to make the python interpreter correctly handle non-ASCII characters in string operations?
                            
                                Unicode Processing in C++
                            
                                Java equivalent to JavaScript's encodeURIComponent that produces identical output?
                            
                                HTML for the Pause symbol in audio and video control
                            
                                How can I perform a culture-sensitive "starts-with" operation from the middle of a string?
                            
                                How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
                            
                                Using awk to remove the Byte-order mark
                            
                                Python str vs unicode types
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1
                            
                                How can I iterate through the unicode codepoints of a Java String?
                            
                                How to compare 'μ' and 'µ' in C# [duplicate]
                            
                                UnicodeEncodeError: 'latin-1' codec can't encode character

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the proper way to URL encode Unicode characters?

Tags:

urlencode

character-encoding

unicode

utf-8

web-standards

Josh Gibson

People also ask

1 Answers

John Biesnecker

Recent Activity

Donate For Us