Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

write()-ing an encoded string in Python 3.x

I've got a unicode string (s) which I want to write into a file.

In Python 2 I could write:

open('filename', 'w').write(s.encode('utf-8'))

But this fails for Python 3. Apparently, s.encode() returns something of type 'bytes', which the write() function does not accept:

TypeError: must be str, not bytes

Does anyone know how to port the above code to Python 3?

Edit:

Thanks to all of you who proposed using binary mode! Unfortunately, this causes a problem with the \n characters. Is there any way to achieve the same result I had with Python 2 (namely to encode non-ANSI characters in UTF-8 while keeping the OS-specific rendition of \n)?

Thanks!

like image 959
Tom Avatar asked Sep 10 '11 16:09

Tom


People also ask

What does encode () do in Python?

The encode() method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.

What is the purpose of encode () and decode () method F?

Python encode() and decode() Functions. Python's encode and decode methods are used to encode and decode the input string, using a given encoding.

What is encoded version of string in Python?

Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes. The process is known as encoding.


1 Answers

You do not want to muck around with manually encoding each and every piece of data like that! Simply pass the encoding as an argument to open, like this:

#!/usr/bin/env python3.2

slist = [
    "Ca\N{LATIN SMALL LETTER N WITH TILDE}on City",
    "na\N{LATIN SMALL LETTER I WITH DIAERESIS}vet\N{LATIN SMALL LETTER E WITH ACUTE}",
    "fa\N{LATIN SMALL LETTER C WITH CEDILLA}ade",
    "\N{GREEK SMALL LETTER BETA}-globulin"
]

with open("/tmp/sample.utf8", mode="w", encoding="utf8") as f:
    for s in slist:
        print(s, file=f)

Now if you the file you made, you’ll see that it says:

$ cat /tmp/sample.utf8
Cañon City
naïveté
façade
β-globulin

And you can see that those are the right code points this way:

$ uniquote -x /tmp/sample.utf 
Ca\x{F1}on City
na\x{EF}vet\x{E9}
fa\x{E7}ade
\x{3B2}-globulin

See how much easier that is? Let the stream object handle any low-level encoding or decoding for you.

Summary: Don't call encode or decode yourself when all you are doing is using them to process a homogeneous stream that's all of it in the same encoding. That's way too much bother for zero gain. Use the encoding argument just once and for all.

like image 53
tchrist Avatar answered Oct 22 '22 14:10

tchrist