The correct way to load unicode text from Python 2.7 is something like:
content = open('filename').read().decode('encoding'):
for line in content.splitlines():
process(line)
(Update: No it isn't. See the answers.)
However, if the file is very large, I might want to read, decode and process it one line at a time, so that the whole file is never loaded into memory at once. Something like:
for line in open('filename'):
process(line.decode('encoding'))
The for
loop's iteration over the open filehandle is a generator that reads one line at a time.
This doesn't work though, because if the file is utf32 encoded, for example, then the bytes in the file (in hex) look something like:
hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n)
And the split into lines done by the for
loop splits on the 0a
byte of the \n
character, resulting in (in hex):
lines[0] = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a
lines[1] = 0x 000000
So part of the \n
character is left at the end of line 1, and the remaining three bytes end up in line 2 (followed by whatever text is actually in line 2.) Calling decode
on either of these lines understandably results in a UnicodeDecodeError
.
UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data
So, obviously enough, splitting a unicode byte stream on 0a
bytes is not the correct way to split it into lines. Instead I should be splitting on occurrences of the full four-byte newline character (0x0a000000). However, I think the correct way to detect these characters is to decode the byte stream into a unicode string and look for \n
characters - and this decoding of the full stream is exactly the operation I'm trying to avoid.
This can't be an uncommon requirement. What's the correct way to handle it?
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
decode(obj, encoding='utf-8', errors='strict') Decodes obj using the codec registered for encoding. Errors may be given to set the desired error handling scheme. The default error handler is 'strict' meaning that decoding errors raise ValueError (or a more codec specific subclass, such as UnicodeDecodeError ).
Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.
Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.
How about trying somethng like:
for line in codecs.open("filename", "rt", "utf32"):
print line
I think this should work.
The codecs
module should do the translation for you.
Try using the codecs module:
for line in codecs.open(filename, encoding='utf32'):
do_something(line)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With