I am having some troubles with understanding the correct way to handle unicode strings in Python. I have read many questions about it but it is still unclear what should I do to avoid problems when reading and writing files.
My goal is to read some huge (up to 7GB) files efficiently line by line. I was doing it with the simple with open(filename) as f: but it I ended up with an error in ASCII decoding.
Then I read the correct way of doing it would be to write:
with codecs.open(filename, 'r', encoding='utf-8') as logfile:
However this ends up in:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x88 in position 13: invalid start byte
Frankly I haven't understood why this exception is raised.
I have found a working solution doing:
with open(filename) as f:
for line in logfile:
line = unicode(line, errors='ignore')
But this approach ended up being incredibly slow. Therefore my question is:
Is there a correct way of doing this, and what is the fastest way? Thanks
Your data is probably not UTF-8 encoded. Figure out the correct encoding and use that instead. We can't tell you what codec is right, because we can't see your data.
If you must specify an error handler, you may as well do so when opening the file. Use the io.open() function; codecs is an older library and has some issues that io (which underpins all I/O in Python 3 and was backported to Python 2) is far more robust and versatile.
The io.open() function takes an errors too:
import io
with io.open(filename, 'r', encoding='utf-8', errors='replace') as logfile:
I picked replace as the error handler so you at least give you placeholder characters for anything that could not be decoded.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With