I wrote the simplest python program that exhibits the error I need help with.
lines_read = 0
urllist_file = open('../fall11_urls.txt', 'r')
for line in urllist_file:
lines_read += 1
print('line count:', lines_read)
I run this on most files and of course it works as expected but "fall11_urls.txt" is a 14 million line text file that contains URLs, one per line. Some of these lines contain text with appeaently non utf-8 characters and I get the error quoted below. I need access every one of these URLs What is the best way to handle this. These URLs can be "anything" some are 400 characters of random characters as in "https://bbswigr.fty.com/_Kcsnuk4J71A/RjzGhXZGmfI/AAAARg/xP3FO-Xbt68/s320/Axolo.jpg Some of these string contain characters such as 0x96 I need my python program to be robust against whatever might be in the file. (If it matters this runs on Ubuntu 16.04)
Here is the error
Traceback (most recent call last):
File "./count_lines.py", line 2, in <module>
for line in urllist_file:
File "/home/chris/.virtualenvs/cvml3/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 5529: invalid start byte
One more bit of information iconv finds the same problem with the same file. See below
$ iconv ../fall11_urls.txt >> /dev/null
iconv: illegal input sequence at position 1042953625
My current work around is UGLY. I use iconv to find the problem then I hand edit the file in vi, then process it. and keep doing this until it is clean but I have MILLIONS of lines in many files to process. And the URLs do mostly work after I hand correct them so these are not noise or "flipped bits".
Answering my own question to let you all know what worked. Yes opening in binary worked I tried it but then I don't have a "text" file. I read up on encoding and found the following works because every binary character value is valid. It is the Safest thing to do.
urllist_file = open('../fall11_urls.txt', 'r', encoding="latin-1")
It seems that anyone opening text files they get from other people and have no way to control or know in advance what is inside might be advised to use "latin-1" because there are no invalid byte values in Latin-1.
Thanks. The suggestion to open in binary got me to investigate what other parameters open() accepts. I'm new to Python and was astounded to find that strings are just a list of bytes. (That is what 20+ years of working in C will condition you to expect.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With