Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading Unicode Files - Python3.2

I'm trying to read some files using Python3.2, the some of the files may contain unicode while others do not.

When I try:

file = open(item_path + item, encoding="utf-8")
for line in file:
    print (repr(line))

I get the error:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 13-16: ordinal not in range(128)

I am following the documentation here: http://docs.python.org/release/3.0.1/howto/unicode.html

Why would Python be trying to encode to ascii at any point in this code?

like image 436
Peter-W Avatar asked Jun 21 '26 03:06

Peter-W


2 Answers

The problem is that repr(line) in Python 3 returns also the Unicode string. It does not convert the above 128 characters to the ASCII escape sequences.

Use ascii(line) instead if you want to see the escape sequences.

Actually, the repr(line) is expected to return the string that if placed in a source code would produce the object with the same value. This way, the Python 3 behaviour is just fine as there is no need for ASCII escape sequences in the source files to express a string with more than ASCII characters. It is quite natural to use UTF-8 or some other Unicode encoding these day. The truth is that Python 2 produced the escape sequences for such characters.

like image 80
pepr Avatar answered Jun 22 '26 16:06

pepr


What's your output encoding? If you remove the call to print(), does it start working?

I suspect you've got a non-UTF-8 locale, so Python is trying to encode repr(line) as ASCII as part of printing it.

To resolve the issue, you must either encode the string and print the byte array, or set your default encoding to something that can handle your strings (UTF-8 being the obvious choice).

like image 45
Andrew Aylett Avatar answered Jun 22 '26 17:06

Andrew Aylett



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!