Python 3.5 UnicodeDecodeError for a file in utf-8 (language is 'ang', Old English)

Tags:

this is my first time using StackOverflow to ask a question, but you've collectively saved so many of my projects over the years that I feel at home already.

I'm using Python3.5 and nltk to parse the Complete Corpus of Old English, which was published to me as 77 text files and an XML doc that designates the file sequence as contiguous segments of a TEI-formatted corpus. Here's the relevant part of the header from the XML doc showing that we are, in fact, working with TEI:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
  <teiHeader type="ISBD-ER">
    <fileDesc>

Right, so as a test, I'm just trying to use NLTK's MTECorpusReader to open the corpus and use the words() method to prove that I'm able to open it. I'm doing all of this from the interactive Python shell, just for ease of testing. Here's all I'm really doing:

# import the reader method    
import nltk.corpus.reader as reader

# open the sequence of files and the XML doc with the MTECorpusReader    
oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*')

# print the first few words in the corpus to the interactive shell
oecorpus.words()

When I try that, I get the following traceback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/util.py", line 765, in __repr__
    for elt in self:
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 397, in iterate_from
    for tok in piece.iterate_from(max(0, start_tok-offset)):
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
    tokens = self.read_block(self._stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/mte.py", line 25, in read_block
    return list(filter(lambda x: x is not None, XMLCorpusView.read_block(self, stream, tagspec, elt_handler)))
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 307, in read_block
    xml_fragment = self._read_xml_fragment(stream)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 252, in _read_xml_fragment
    xml_block = stream.read(self._BLOCK_SIZE)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1097, in read
    chars = self._read(size)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1367, in _read
    chars, bytes_decoded = self._incr_decode(bytes)
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1398, in _incr_decode
    return self.decode(bytes, 'strict')
  File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 59: invalid start byte

So, as I'm a valiant StackOverflowsketeer, I've determined that either one or more files is corrupted or there's some character in the file(s) that contains a character that Python's utf-8 decoder doesn't know how to handle. I can be fairly certain of this file's integrity (take my word for it), so I'm pursuing

I tried the following to reformat the 77 text files with no apparent effect:

for file in loglist:
    bufferfile = open(file, encoding='utf-8', errors='replace')
    bufferfile.close()
loglist = [name for name in os.listdir('.') if os.path.isfile(name)]

So my questions are:

1) Does my approach so far make sense, or have I screwed something up in my troubleshooting so far?

2) Is it fair to conclude at this point that the issue must be with the XML doc, based on the fact that the UTF-8 error shows up very early (at hex position 59) and the fact that my utf-8 error replacement script made no difference to the problem? If I'm wrong to assume that, then how can I better isolate the issue?

3) If we can conclude that the issue is with the XML doc, what's the best way to clear it up? Is it feasible for me to try to find that hex byte and the ASCII it corresponds to and change the character?

Thank you in advance for your help!

486

asked Jul 14 '16 00:07

gatsbysghost

1 Answers

Your conversion technique didn't work because you never read and wrote the file back out again.

0x80 is not a valid byte in UTF-8 or any iso-8859-* character set. It is valid in Windows codepages, but only Unicode can support Old English characters, so you have some very broken data.

To convert UTF-8 with bad bytes do:

with open('input.txt', 'r', encoding='utf-8', errors='ignore') as input,
        open('output.txt', 'w', encoding='utf-8') as output:

     output.write(input.read())

If you don't care about losing data, you may get away using the encoding argument on MTECorpusReader:

oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*', encoding='cp1252')

which will make 0x80 a Euro (€) symbol.

164

answered Oct 15 '22 07:10

Alastair McCormack

Related questions
                            
                                Is the interaction between python unittest subTest and skipTest defined?
                            
                                Doctest not recognizing __future__.division
                            
                                Explain the difference between these Midpoint Algorithms
                            
                                DataFrame of DataFrames in Python (Pandas)
                            
                                How to design a library public api avoiding to expose internals?
                            
                                Create arg string from ArgumentParser parsed args in Python
                            
                                Is there a complete list of built-in functions that cannot be called with keyword argument?
                            
                                Python meta-analysis library
                            
                                scipy eigh gives negative eigenvalues for positive semidefinite matrix
                            
                                Union over fields having different names using peewee
                            
                                pythonic way to index list of objects
                            
                                Setting values with pandas.DataFrame
                            
                                Python -- Optimize system of inequalities
                            
                                centerline of a polygonal blob (binary image)
                            
                                Change initializer of Variable in Tensorflow
                            
                                How is Python itself tested?
                            
                                How to predict a simple sequence using seq2seq from tensorflow?
                            
                                AttributeError: module 'os' has no attribute 'setsid'
                            
                                How to avoid AttributeError: '_tkinter.tkapp' object has no attribute 'PassCheck'
                            
                                does npartitions influence the result of dask.dataframe.head()?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python 3.5 UnicodeDecodeError for a file in utf-8 (language is 'ang', Old English)

Tags:

python

python-3.x

utf-8

nltk

gatsbysghost

People also ask

1 Answers

Alastair McCormack

Recent Activity

Donate For Us