Unicode (UTF-8) reading and writing to files in Python

Tags:

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it. ss = u'Capit\xe1n' ss8 = ss.encode('utf8') repr(ss), repr(ss8)

("u'Capit\xe1n'", "'Capit\xc3\xa1n'")

print ss, ss8 print >> open('f1','w'), ss8  >>> file('f1').read() 'Capit\xc3\xa1n\n'

So I type in Capit\xc3\xa1n into my favorite editor, in file f2.

Then:

>>> open('f1').read() 'Capit\xc3\xa1n\n' >>> open('f2').read() 'Capit\\xc3\\xa1n\n' >>> open('f1').read().decode('utf8') u'Capit\xe1n\n' >>> open('f2').read().decode('utf8') u'Capit\\xc3\\xa1n\n'

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?

What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss) '"Capit\u00e1n"' >>> print >> file('f3','w'), simplejson.dumps(ss) >>> simplejson.load(open('f3')) u'Capit\xe1n'

949

asked Jan 29 '09 15:01

Gregg Lind

1 Answers

Rather than mess with the encode and decode methods I find it easier to specify the encoding when opening the file. The io module (added in Python 2.6) provides an io.open function, which has an encoding parameter.

Use the open method from the io module.

>>>import io >>>f = io.open("test", mode="r", encoding="utf-8")

Then after calling f's read() function, an encoded Unicode object is returned.

>>>f.read() u'Capit\xe1l\n\n'

Note that in Python 3, the io.open function is an alias for the built-in open function. The built-in open function only supports the encoding argument in Python 3, not Python 2.

Edit: Previously this answer recommended the codecs module. The codecs module can cause problems when mixing read() and readline(), so this answer now recommends the io module instead.

Use the open method from the codecs module.

>>>import codecs >>>f = codecs.open("test", "r", "utf-8")

Then after calling f's read() function, an encoded Unicode object is returned.

>>>f.read() u'Capit\xe1l\n\n'

If you know the encoding of a file, using the codecs package is going to be much less confusing.

See http://docs.python.org/library/codecs.html#codecs.open

130

answered Oct 17 '22 20:10

Tim Swast

Related questions
                            
                                Else clause on Python while statement
                            
                                python: how to identify if a variable is an array or a scalar
                            
                                How can I get a list of all classes within current module in Python?
                            
                                Python dictionary: are keys() and values() always the same order?
                            
                                Creating a Pandas DataFrame from a Numpy array: How do I specify the index column and column headers?
                            
                                Convert Django Model object to dict with all of the fields intact
                            
                                Convert Pandas Column to DateTime
                            
                                Format timedelta to string
                            
                                How to empty a list?
                            
                                UnicodeDecodeError, invalid continuation byte
                            
                                Breaking out of nested loops [duplicate]
                            
                                How to draw vertical lines on a given plot in matplotlib
                            
                                How to make inline plots in Jupyter Notebook larger? [duplicate]
                            
                                Python 3 ImportError: No module named 'ConfigParser'
                            
                                Python locale error: unsupported locale setting
                            
                                In pytest, what is the use of conftest.py files?
                            
                                Difference between filter and filter_by in SQLAlchemy
                            
                                How to convert a PIL Image into a numpy array?
                            
                                Showing the stack trace from a running Python application
                            
                                Measuring elapsed time with the Time module

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unicode (UTF-8) reading and writing to files in Python

Tags:

python

io

unicode

utf-8

Gregg Lind

People also ask

1 Answers

Tim Swast

Recent Activity

Donate For Us