I have a text file which the publisher (the US Securities Exchange Commission) asserts is encoded in UTF-8 (https://www.sec.gov/files/aqfs.pdf, section 4). I'm processing the lines with the following code:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with codecs.open(filename, 'r', encoding='utf-8', errors='strict') as f:
fields = f.readline().strip().split('\t')
for line in f.readlines():
yield process_tag_record(fields, line)
I receive the following error:
Traceback (most recent call last):
File "/home/randm/Projects/finance/secxbrl.py", line 151, in <module>
main()
File "/home/randm/Projects/finance/secxbrl.py", line 143, in main
all_tags = list(tags("tag.txt"))
File "/home/randm/Projects/finance/secxbrl.py", line 109, in tags
content = f.read()
File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 698, in read
return self.reader.read(size)
File "/home/randm/Libraries/anaconda3/lib/python3.6/codecs.py", line 501, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
Given that I probably can't go back to the SEC and tell them they have files that don't seem to be encoded in UTF-8, how should I debug and catch this error?
What have I tried
I did a hexdump of the file and found that the offending text was the text "SUPPLEMENTAL DISCLOSURE OF NON�CASH INVESTING". If I decode the offending byte as a hex code point (i.e. "U+00AD"), it makes sense in context as it is the soft hyphen. But the following does not seem to work:
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> b"\x41".decode("utf-8")
'A'
>>> b"\xad".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xad in position 0: invalid start byte
>>> b"\xc2ad".decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xc2 in position 0: invalid continuation byte
I've used errors='replace'
, which seems to pass. But I'd like to understand what will happen if I try to insert that into a database.
Hexdump:
0036ae40 31 09 09 09 09 53 55 50 50 4c 45 4d 45 4e 54 41 |1....SUPPLEMENTA|
0036ae50 4c 20 44 49 53 43 4c 4f 53 55 52 45 20 4f 46 20 |L DISCLOSURE OF |
0036ae60 4e 4f 4e ad 43 41 53 48 20 49 4e 56 45 53 54 49 |NON.CASH INVESTI|
0036ae70 4e 47 20 41 4e 44 20 46 49 4e 41 4e 43 49 4e 47 |NG AND FINANCING|
0036ae80 20 41 43 54 49 56 49 54 49 45 53 3a 09 0a 50 72 | ACTIVITIES:..Pr|
The UnicodeDecodeError normally happens when decoding an str string from a certain coding. Since codings map only a limited number of str strings to unicode characters, an illegal sequence of str characters will cause the coding-specific decode() to fail.
UnicodeDecodeError: 'ascii' codec can't decode byte generally happens when you try to convert a Python 2. x str that contains non-ASCII to a Unicode string without specifying the encoding of the original string. In brief, Unicode strings are an entirely separate type of Python string that does not contain any encoding.
The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).
You have a corrupted data file. If that character really is meant to be a U+00AD SOFT HYPHEN, then you are missing a 0xC2 byte:
>>> '\u00ad'.encode('utf8')
b'\xc2\xad'
Of all the possible UTF-8 encodings that end in 0xAD, a soft hyphen does make the most sense. However, it is indicative of a data set that may have other bytes missing. You just happened to have hit one that matters.
I'd go back to the source of this dataset and verify that the file was not corrupted when downloaded. Otherwise, using error='replace'
is a viable work-around, provided no delimiters (tabs, newlines, etc.) are missing.
Another possibility is that the SEC is really using a different encoding for the file; for example in Windows Codepage 1252 and Latin-1, 0xAD
is the correct encoding of a soft hyphen. And indeed, when I download the same dataset directly (warning, large ZIP file linked), and open tags.txt
, I can't decode the data as UTF-8:
>>> open('/tmp/2017q1/tag.txt', encoding='utf8').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 3583587: invalid start byte
>>> from pprint import pprint
>>> f = open('/tmp/2017q1/tag.txt', 'rb')
>>> f.seek(3583550)
3583550
>>> pprint(f.read(100))
(b'1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH INVESTING AND FINANCING A'
b'CTIVITIES:\t\nProceedsFromSaleOfIn')
There are two such non-ASCII characters in the file:
>>> f.seek(0)
0
>>> pprint([l for l in f if any(b > 127 for b in l)])
[b'SupplementalDisclosureOfNoncashInvestingAndFinancingActivitiesAbstract\t0'
b'001654954-17-000551\t1\t1\t\t\t\tSUPPLEMENTAL DISCLOSURE OF NON\xadCASH I'
b'NVESTING AND FINANCING ACTIVITIES:\t\n',
b'HotelKranichhheMember\t0001558370-17-001446\t1\t0\tmember\tD\t\tHotel Krani'
b'chhhe [Member]\tRepresents information pertaining to Hotel Kranichh\xf6h'
b'e.\n']
Hotel Kranichh\xf6he
decoded as Latin-1 is Hotel Kranichhöhe.
There are also several 0xC1 / 0xD1 pairs in the file:
>>> f.seek(0)
0
>>> quotes = [l for l in f if any(b in {0x1C, 0x1D} for b in l)]
>>> quotes[0].split(b'\t')[-1][50:130]
b'Temporary Payroll Tax Cut Continuation Act of 2011 (\x1cTCCA\x1d) recognized during th'
>>> quotes[1].split(b'\t')[-1][50:130]
b'ributory defined benefit pension plan (the \x1cAetna Pension Plan\x1d) to allow certai'
I'm betting those are really U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK characters; note the 1C
and 1D
parts. It almost feels as if their encoder took UTF-16 and stripped out all the high bytes, rather than encode to UTF-8 properly!
There is no codec shipping with Python that would encode '\u201C\u201D'
to b'\x1C\x1D'
, making it all the more likely that the SEC has botched their encoding process somewhere. In fact, there are also 0x13 and 0x14 characters that are probably en and em dashes (U+2013 and U+2014), as well as 0x19 bytes that are almost certainly single quotes (U+2019). All that is missing to complete the picture is a 0x18 byte to represent U+2018.
If we assume that the encoding is broken, we can attempt to repair. The following code would read the file and fix the quotes issues, assuming that the rest of the data does not use characters outside of Latin-1 apart from the quotes:
_map = {
# dashes
0x13: '\u2013', 0x14: '\u2014',
# single quotes
0x18: '\u2018', 0x19: '\u2019',
# double quotes
0x1c: '\u201c', 0x1d: '\u201d',
}
def repair(line, _map=_map):
"""Repair mis-encoded SEC data. Assumes line was decoded as Latin-1"""
return line.translate(_map)
then apply that to lines you read:
with open(filename, 'r', encoding='latin-1') as f:
repaired = map(repair, f)
fields = next(repaired).strip().split('\t')
for line in repaired:
yield process_tag_record(fields, line)
Separately, addressing your posted code, you are making Python work harder than it needs to. Don't use codecs.open()
; that's legacy code that has known issues and is slower than the newer Python 3 I/O layer. Just use open()
. Do not use f.readlines()
; you don't need to read the whole file into a list here. Just iterate over the file directly:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
fields = next(f).strip().split('\t')
for line in f:
yield process_tag_record(fields, line)
If process_tag_record
also splits on tabs, use a csv.reader()
object and avoid splitting each row manually:
import csv
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.reader(f, delimiter='\t')
fields = next(reader)
for row in reader:
yield process_tag_record(fields, row)
If process_tag_record
combines the fields
list with the values in row
to form a dictionary, just use csv.DictReader()
instead:
def tags(filename):
"""Yield Tag instances from tag.txt."""
with open(filename, 'r', encoding='utf-8', errors='strict') as f:
reader = csv.DictReader(f, delimiter='\t')
# first row is used as keys for the dictionary, no need to read fields manually.
yield from reader
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With