Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: what does "...".encode("utf8") fix?

I wanted to url encode a python string and got exceptions with hebrew strings. I couldn't fix it and started doing some guess oriented programming. Finally, doing mystr = mystr.encode("utf8") before sending it to the url encoder saved the day.

Can somebody explain what happened? What does .encode("utf8") do? My original string was a unicode string anyways (i.e. prefixed by a u).

like image 525
flybywire Avatar asked Jul 20 '10 14:07

flybywire


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

What is UTF-8 and what problem does it solve?

UTF-8 is a way of encoding Unicode so that an ASCII text file encodes to itself. No wasted space, beyond the initial bit of every byte ASCII doesn't use. And if your file is mostly ASCII text with a few non-ASCII characters sprinkled in, the non-ASCII characters just make your file a little longer.

Why is UTF-8 a good choice for the default editor encoding in Python?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things.

Why is UTF-8 good?

UTF-8 is currently the most popular encoding method on the internet because it can efficiently store text containing any character. UTF-16 is another encoding method, but is less efficient for storing text files (except for those written in certain non-English languages).


1 Answers

My original string was a unicode string anyways (i.e. prefixed by a u)

...which is the problem. It wasn't a "string", as such, but a "Unicode object". It contains a sequence of Unicode code points. These code points must, of course, have some internal representation that Python knows about, but whatever that is is abstracted away and they're shown as those \uXXXX entities when you print repr(my_u_str).

To get a sequence of bytes that another program can understand, you need to take that sequence of Unicode code points and encode it. You need to decide on the encoding, because there are plenty to choose from. UTF8 and UTF16 are common choices. ASCII could be too, if it fits. u"abc".encode('ascii') works just fine.

Do my_u_str = u"\u2119ython" and then type(my_u_str) and type(my_u_str.encode('utf8')) to see the difference in types: The first is <type 'unicode'> and the second is <type 'str'>. (Under Python 2.5 and 2.6, anyway).

Things are different in Python 3, but since I rarely use it I'd be talking out of my hat if I tried to say anything authoritative about it.

like image 200
detly Avatar answered Oct 17 '22 13:10

detly