this is my first time using StackOverflow to ask a question, but you've collectively saved so many of my projects over the years that I feel at home already.
I'm using Python3.5 and nltk to parse the Complete Corpus of Old English, which was published to me as 77 text files and an XML doc that designates the file sequence as contiguous segments of a TEI-formatted corpus. Here's the relevant part of the header from the XML doc showing that we are, in fact, working with TEI:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader type="ISBD-ER">
<fileDesc>
Right, so as a test, I'm just trying to use NLTK's MTECorpusReader to open the corpus and use the words() method to prove that I'm able to open it. I'm doing all of this from the interactive Python shell, just for ease of testing. Here's all I'm really doing:
# import the reader method
import nltk.corpus.reader as reader
# open the sequence of files and the XML doc with the MTECorpusReader
oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*')
# print the first few words in the corpus to the interactive shell
oecorpus.words()
When I try that, I get the following traceback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/util.py", line 765, in __repr__
for elt in self:
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 397, in iterate_from
for tok in piece.iterate_from(max(0, start_tok-offset)):
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/util.py", line 291, in iterate_from
tokens = self.read_block(self._stream)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/mte.py", line 25, in read_block
return list(filter(lambda x: x is not None, XMLCorpusView.read_block(self, stream, tagspec, elt_handler)))
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 307, in read_block
xml_fragment = self._read_xml_fragment(stream)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/corpus/reader/xmldocs.py", line 252, in _read_xml_fragment
xml_block = stream.read(self._BLOCK_SIZE)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1097, in read
chars = self._read(size)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1367, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/nltk/data.py", line 1398, in _incr_decode
return self.decode(bytes, 'strict')
File "/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 59: invalid start byte
So, as I'm a valiant StackOverflowsketeer, I've determined that either one or more files is corrupted or there's some character in the file(s) that contains a character that Python's utf-8 decoder doesn't know how to handle. I can be fairly certain of this file's integrity (take my word for it), so I'm pursuing
I tried the following to reformat the 77 text files with no apparent effect:
for file in loglist:
bufferfile = open(file, encoding='utf-8', errors='replace')
bufferfile.close()
loglist = [name for name in os.listdir('.') if os.path.isfile(name)]
So my questions are:
1) Does my approach so far make sense, or have I screwed something up in my troubleshooting so far?
2) Is it fair to conclude at this point that the issue must be with the XML doc, based on the fact that the UTF-8 error shows up very early (at hex position 59) and the fact that my utf-8 error replacement script made no difference to the problem? If I'm wrong to assume that, then how can I better isolate the issue?
3) If we can conclude that the issue is with the XML doc, what's the best way to clear it up? Is it feasible for me to try to find that hex byte and the ASCII it corresponds to and change the character?
Thank you in advance for your help!
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)
UTF-8 supports any unicode character, which pragmatically means any natural language (Coptic, Sinhala, Phonecian, Cherokee etc), as well as many non-spoken languages (Music notation, mathematical symbols, APL). The stated objective of the Unicode consortium is to encompass all communications.
Your conversion technique didn't work because you never read and wrote the file back out again.
0x80
is not a valid byte in UTF-8 or any iso-8859-* character set. It is valid in Windows codepages, but only Unicode can support Old English characters, so you have some very broken data.
To convert UTF-8 with bad bytes do:
with open('input.txt', 'r', encoding='utf-8', errors='ignore') as input,
open('output.txt', 'w', encoding='utf-8') as output:
output.write(input.read())
If you don't care about losing data, you may get away using the encoding
argument on MTECorpusReader:
oecorpus = reader.mte.MTECorpusReader('/Users/me/Documents/0163','.*', encoding='cp1252')
which will make 0x80
a Euro (€) symbol.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With