Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 3.5 UnicodeDecodeError for a file in utf-8 (language is 'ang', Old English)

this is my first time using StackOverflow to ask a question, but you've collectively saved so many of my projects over the years that I feel at home already.

I'm using Python3.5 and nltk to parse the Complete Corpus of Old English, which was published to me as 77 text files and an XML doc that designates the file sequence as contiguous segments of a TEI-formatted corpus. Here's the relevant part of the header from the XML doc showing that we are, in fact, working with TEI:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader type="ISBD-ER">
    <fileDesc>

Right, so as a test, I'm just trying to use NLTK's MTECorpusReader to open the corpus and use the words() method to prove that I'm able to open it. I'm doing all of this from the interactive Python shell, just for ease of testing. Here's all I'm really doing:

# import the reader method    
import nltk.corpus.reader as reader

# open the sequence of files and the XML doc with the MTECorpusReader    
oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*')

# print the first few words in the corpus to the interactive shell
oecorpus.words()

When I try that, I get the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/util.py", line 765, in __repr__
    for elt in self:
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 397, in iterate_from
    for tok in piece.iterate_from(max(0, start_tok-offset)):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
    tokens = self.read_block(self._stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/mte.py", line 25, in read_block
    return list(filter(lambda x: x is not None, XMLCorpusView.read_block(self, stream, tagspec, elt_handler)))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 307, in read_block
    xml_fragment = self._read_xml_fragment(stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 252, in _read_xml_fragment
    xml_block = stream.read(self._BLOCK_SIZE)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1097, in read
    chars = self._read(size)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1367, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1398, in _incr_decode
    return self.decode(bytes, 'strict')
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 59: invalid start byte

So, as I'm a valiant StackOverflowsketeer, I've determined that either one or more files is corrupted or there's some character in the file(s) that contains a character that Python's utf-8 decoder doesn't know how to handle. I can be fairly certain of this file's integrity (take my word for it), so I'm pursuing

I tried the following to reformat the 77 text files with no apparent effect:

for file in loglist:
    bufferfile = open(file, encoding='utf-8', errors='replace')
    bufferfile.close()
loglist = [name for name in os.listdir('.') if os.path.isfile(name)]

So my questions are:

1) Does my approach so far make sense, or have I screwed something up in my troubleshooting so far?

2) Is it fair to conclude at this point that the issue must be with the XML doc, based on the fact that the UTF-8 error shows up very early (at hex position 59) and the fact that my utf-8 error replacement script made no difference to the problem? If I'm wrong to assume that, then how can I better isolate the issue?

3) If we can conclude that the issue is with the XML doc, what's the best way to clear it up? Is it feasible for me to try to find that hex byte and the ASCII it corresponds to and change the character?

Thank you in advance for your help!

like image 486
gatsbysghost Avatar asked Jul 14 '16 00:07

gatsbysghost


People also ask

What is a Python UTF-8?

UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)

What are UTF-8 languages?

UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.


1 Answers

Your conversion technique didn't work because you never read and wrote the file back out again.

0x80 is not a valid byte in UTF-8 or any iso-8859-* character set. It is valid in Windows codepages, but only Unicode can support Old English characters, so you have some very broken data.

To convert UTF-8 with bad bytes do:

with open('input.txt', 'r', encoding='utf-8', errors='ignore') as input,
        open('output.txt', 'w', encoding='utf-8') as output:

     output.write(input.read())

If you don't care about losing data, you may get away using the encoding argument on MTECorpusReader:

oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*', encoding='cp1252')

which will make 0x80 a Euro (€) symbol.

like image 164
Alastair McCormack Avatar answered Oct 15 '22 07:10

Alastair McCormack