Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I decode unicode one line at a time in Python 2.7?

The correct way to load unicode text from Python 2.7 is something like:

content = open('filename').read().decode('encoding'):
for line in content.splitlines():
    process(line)

(Update: No it isn't. See the answers.)

However, if the file is very large, I might want to read, decode and process it one line at a time, so that the whole file is never loaded into memory at once. Something like:

for line in open('filename'):
    process(line.decode('encoding'))        

The for loop's iteration over the open filehandle is a generator that reads one line at a time.

This doesn't work though, because if the file is utf32 encoded, for example, then the bytes in the file (in hex) look something like:

hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n)

And the split into lines done by the for loop splits on the 0a byte of the \n character, resulting in (in hex):

lines[0] = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a
lines[1] = 0x 000000

So part of the \n character is left at the end of line 1, and the remaining three bytes end up in line 2 (followed by whatever text is actually in line 2.) Calling decode on either of these lines understandably results in a UnicodeDecodeError.

UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data

So, obviously enough, splitting a unicode byte stream on 0a bytes is not the correct way to split it into lines. Instead I should be splitting on occurrences of the full four-byte newline character (0x0a000000). However, I think the correct way to detect these characters is to decode the byte stream into a unicode string and look for \n characters - and this decoding of the full stream is exactly the operation I'm trying to avoid.

This can't be an uncommon requirement. What's the correct way to handle it?

like image 200
Jonathan Hartley Avatar asked Aug 08 '12 14:08

Jonathan Hartley


People also ask

How do you find the unicode value of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

What is decode (' UTF 8 ') in Python?

decode(obj, encoding='utf-8', errors='strict') Decodes obj using the codec registered for encoding. Errors may be given to set the desired error handling scheme. The default error handler is 'strict' meaning that decoding errors raise ValueError (or a more codec specific subclass, such as UnicodeDecodeError ).

How do you decode encode text in Python?

Python bytes decode() function is used to convert bytes to string object. Both these functions allow us to specify the error handling scheme to use for encoding/decoding errors. The default is 'strict' meaning that encoding errors raise a UnicodeEncodeError.

Does Python 2 support unicode?

Python 2 uses str type to store bytes and unicode type to store unicode code points. All strings by default are str type — which is bytes~ And Default encoding is ASCII. So if an incoming file is Cyrillic characters, Python 2 might fail because ASCII will not be able to handle those Cyrillic Characters.


2 Answers

How about trying somethng like:

for line in codecs.open("filename", "rt", "utf32"):
    print line

I think this should work.

The codecs module should do the translation for you.

like image 127
Simon Callan Avatar answered Oct 19 '22 03:10

Simon Callan


Try using the codecs module:

for line in codecs.open(filename, encoding='utf32'):
    do_something(line)
like image 38
Andreas Jung Avatar answered Oct 19 '22 01:10

Andreas Jung