Python reading a file into unicode strings

Question

I am having some troubles with understanding the correct way to handle unicode strings in Python. I have read many questions about it but it is still unclear what should I do to avoid problems when reading and writing files.

My goal is to read some huge (up to 7GB) files efficiently line by line. I was doing it with the simple with open(filename) as f: but it I ended up with an error in ASCII decoding.

Then I read the correct way of doing it would be to write:

with codecs.open(filename, 'r', encoding='utf-8') as logfile:

However this ends up in:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 13: invalid start byte

Frankly I haven't understood why this exception is raised.

I have found a working solution doing:

with open(filename) as f:
    for line in logfile:
        line = unicode(line, errors='ignore')

But this approach ended up being incredibly slow. Therefore my question is:

Is there a correct way of doing this, and what is the fastest way? Thanks

Martijn Pieters · Accepted Answer

Your data is probably not UTF-8 encoded. Figure out the correct encoding and use that instead. We can't tell you what codec is right, because we can't see your data.

If you must specify an error handler, you may as well do so when opening the file. Use the io.open() function; codecs is an older library and has some issues that io (which underpins all I/O in Python 3 and was backported to Python 2) is far more robust and versatile.

The io.open() function takes an errors too:

import io

with io.open(filename, 'r', encoding='utf-8', errors='replace') as logfile:

I picked replace as the error handler so you at least give you placeholder characters for anything that could not be decoded.

Python reading a file into unicode strings

Tags:

python

string

file

unicode

ClonedOne

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

Python reading a file into unicode strings

Tags:

python

string

file

unicode

ClonedOne

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us