The correct way to load unicode text from Python 2.7 is something like: <pre class="prettyprint"><code>content = open('filename').read().decode('encoding'): for line in content.splitlines(): process(line) </code></pre> (Update: No it isn't. See the answers.) However, if the file is very large, I might want to read, decode and process it one line at a time, so that the whole file is never loaded into memory at once. Something like: <pre class="prettyprint"><code>for line in open('filename'): process(line.decode('encoding')) </code></pre> The <code>for</code> loop's iteration over the open filehandle is a generator that reads one line at a time. This doesn't work though, because if the file is utf32 encoded, for example, then the bytes in the file (in hex) look something like: <pre class="prettyprint"><code>hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n) </code></pre> And the split into lines done by the <code>for</code> loop splits on the <code>0a</code> byte of the <code>\n</code> character, resulting in (in hex): <pre class="prettyprint"><code>lines[0] = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a lines[1] = 0x 000000 </code></pre> So part of the <code>\n</code> character is left at the end of line 1, and the remaining three bytes end up in line 2 (followed by whatever text is actually in line 2.) Calling <code>decode</code> on either of these lines understandably results in a <code>UnicodeDecodeError</code>. <pre class="prettyprint"><code>UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data </code></pre> So, obviously enough, splitting a unicode byte stream on <code>0a</code> bytes is not the correct way to split it into lines. Instead I should be splitting on occurrences of the full four-byte newline character (0x0a000000). However, I think the correct way to detect these characters is to decode the byte stream into a unicode string and look for <code>\n</code> characters - and this decoding of the full stream is exactly the operation I'm trying to avoid. This can't be an uncommon requirement. What's the correct way to handle it?

How about trying somethng like: <pre class="prettyprint"><code>for line in codecs.open("filename", "rt", "utf32"): print line </code></pre> I think this should work. The <code>codecs</code> module should do the translation for you.

Try using the codecs module: <pre class="prettyprint"><code>for line in codecs.open(filename, encoding='utf32'): do_something(line) </code></pre>

How do I decode unicode one line at a time in Python 2.7?

Tags:

python

generator

file-io

unicode

python-2.7

The correct way to load unicode text from Python 2.7 is something like:

content = open('filename').read().decode('encoding'):
for line in content.splitlines():
    process(line)

(Update: No it isn't. See the answers.)

However, if the file is very large, I might want to read, decode and process it one line at a time, so that the whole file is never loaded into memory at once. Something like:

for line in open('filename'):
    process(line.decode('encoding'))

The for loop's iteration over the open filehandle is a generator that reads one line at a time.

This doesn't work though, because if the file is utf32 encoded, for example, then the bytes in the file (in hex) look something like:

hello\n = 68000000(h) 65000000(e) 6c000000(l) 6c000000(l) 6f000000(o) 0a000000(\n)

And the split into lines done by the for loop splits on the 0a byte of the \n character, resulting in (in hex):

lines[0] = 0x 68000000 65000000 6c000000 6c000000 6f000000 0a
lines[1] = 0x 000000

So part of the \n character is left at the end of line 1, and the remaining three bytes end up in line 2 (followed by whatever text is actually in line 2.) Calling decode on either of these lines understandably results in a UnicodeDecodeError.

UnicodeDecodeError: 'utf32' codec can't decode byte 0x0a in position 24: truncated data

So, obviously enough, splitting a unicode byte stream on 0a bytes is not the correct way to split it into lines. Instead I should be splitting on occurrences of the full four-byte newline character (0x0a000000). However, I think the correct way to detect these characters is to decode the byte stream into a unicode string and look for \n characters - and this decoding of the full stream is exactly the operation I'm trying to avoid.

This can't be an uncommon requirement. What's the correct way to handle it?

200

asked Aug 08 '12 14:08

Jonathan Hartley

2 Answers

How about trying somethng like:

for line in codecs.open("filename", "rt", "utf32"):
    print line

I think this should work.

The codecs module should do the translation for you.

127

answered Oct 19 '22 03:10

Simon Callan

Try using the codecs module:

for line in codecs.open(filename, encoding='utf32'):
    do_something(line)

answered Oct 19 '22 01:10

Andreas Jung

Related questions
                            
                                does python has its error report message like $! in perl
                            
                                How to migrate a python site to another machine?
                            
                                Comparing all elements of two tuples (with all() functionality)
                            
                                How to convert my bytearray('b\x9e\x18K\x9a') to something like this--> '\x9e\x18K\x9a'<---just str ,not array
                            
                                Can't find bjam in boost homebrew installation
                            
                                Where is nose's assert_raises function?
                            
                                Python custom set intersection
                            
                                Understanding instance and class variable python
                            
                                How to use dj-database-url while connecting with postgresql in heroku using python
                            
                                How to read lines from a CSV variable into a multidimensional array in python?
                            
                                python selenium multiple test cases
                            
                                Convert a datetime.date object into a datetime.datetime object with zeros for any missing time attributes
                            
                                Stacked bar chart with differently ordered colors using matplotlib
                            
                                Django - queryset vs model in Generic View
                            
                                Python Regex - Remove special characters but preserve apostraphes
                            
                                Sending e-mail after scrape in scrapy
                            
                                Python Iterate through characters
                            
                                Can we access inner function outside its scope of outer function in python using outer function?
                            
                                django messages not showing
                            
                                Gzip response in Flask/Tornado

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With