Python - dealing with mixed-encoding files

Tags:

I have a file which is mostly UTF-8, but some Windows-1252 characters have also found their way in.

I created a table to map from the Windows-1252 (cp1252) characters to their Unicode counterparts, and would like to use it to fix the mis-encoded characters, e.g.

cp1252_to_unicode = {
    "\x85": u'\u2026', # …
    "\x91": u'\u2018', # ‘
    "\x92": u'\u2019', # ’
    "\x93": u'\u201c', # “
    "\x94": u'\u201d', # ”
    "\x97": u'\u2014'  # —
}

for l in open('file.txt'):
    for c, u in cp1252_to_unicode.items():
        l = l.replace(c, u)

But attempting to do the replace this way results in a UnicodeDecodeError being raised, e.g.:

"\x85".replace("\x85", u'\u2026')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x85 in position 0: ordinal not in range(128)

Any ideas for how to deal with this?

502

asked Apr 04 '12 10:04

Keith Hughitt

2 Answers

If you try to decode this string as utf-8, as you already know, you will get an "UnicodeDecode" error, as these spurious cp1252 characters are invalid utf-8 -

However, Python codecs allow you to register a callback to handle encoding/decoding errors, with the codecs.register_error function - it gets the UnicodeDecodeerror a a parameter - you can write such a handler that atempts to decode the data as "cp1252", and continues the decoding in utf-8 for the rest of the string.

In my utf-8 terminal, I can build a mixed incorrect string like this:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> print a
maçã ma�� 
>>> a.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data

I wrote the said callback function here, and found a catch: even if you increment the position from which to decode the string by 1, so that it would start on the next chratcer, if the next character is also not utf-8 and out of range(128), the error is raised at the first out of range(128) character - that means, the decoding "walks back" if consecutive non-ascii, non-utf-8 chars are found.

The worka round this is to have a state variable in the error_handler which detects this "walking back" and resume decoding from the last call to it - on this short example, I implemented it as a global variable - (it will have to be manually reset to "-1" before each call to the decoder):

import codecs

last_position = -1

def mixed_decoder(unicode_error):
    global last_position
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode("cp1252")
    #new_char = u"_"
    return new_char, position + 1

codecs.register_error("mixed", mixed_decoder)

And on the console:

>>> a = u"maçã ".encode("utf-8") + u"maçã ".encode("cp1252")
>>> last_position = -1
>>> print a.decode("utf-8", "mixed")
maçã maçã

117

answered Sep 20 '22 07:09

jsbueno

With thanks to jsbueno and a whack of other Google searches and other pounding I solved it this way.

#The following works very well but it does not allow for any attempts to FIX the data.
xmlText = unicode(xmlText, errors='replace').replace(u"\uFFFD", "?")

This version allows for a limited opportunity to repair invalid characters. Unknown characters are replaced with a safe value.

import codecs    
replacement = {
   '85' : '...',           # u'\u2026' ... character.
   '96' : '-',             # u'\u2013' en-dash
   '97' : '-',             # u'\u2014' em-dash
   '91' : "'",             # u'\u2018' left single quote
   '92' : "'",             # u'\u2019' right single quote
   '93' : '"',             # u'\u201C' left double quote
   '94' : '"',             # u'\u201D' right double quote
   '95' : "*"              # u'\u2022' bullet
}

#This is is more complex but allows for the data to be fixed.
def mixed_decoder(unicodeError):
    errStr = unicodeError[1]
    errLen = unicodeError.end - unicodeError.start
    nextPosition = unicodeError.start + errLen
    errHex = errStr[unicodeError.start:unicodeError.end].encode('hex')
    if errHex in replacement:
        return u'%s' % replacement[errHex], nextPosition
    return u'%s' % errHex, nextPosition   # Comment this line out to get a question mark
    return u'?', nextPosition

codecs.register_error("mixed", mixed_decoder)

xmlText = xmlText.decode("utf-8", "mixed")

Basically I attempt to turn it into utf8. For any characters that fail I just convert it to HEX so I can display or look it up in a table of my own.

This is not pretty but it does allow me to make sense of messed up data

answered Sep 21 '22 07:09

AnthonyVO

Related questions
                            
                                ploting filled polygons in python
                            
                                User ID to Username tweepy
                            
                                How can i get all models in django 1.8
                            
                                What does this: s[s[1:] == s[:-1]] do in numpy?
                            
                                Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')
                            
                                Register Celery Class-based Task
                            
                                How to access the NoneType type?
                            
                                What is the alternative of numpy.newaxis in tensorflow?
                            
                                InvalidInstanceId: An error occurred (InvalidInstanceId) when calling the SendCommand operation
                            
                                cv2.imshow image window placement is outside of viewable screen
                            
                                TypeError: unsupported operand type(s) for +: 'PosixPath' and 'str'
                            
                                What does the return value of gc.collect() actually mean?
                            
                                Plotly: How to change figure size?
                            
                                recursive lambda-expressions possible?
                            
                                Eclipse+PyDev+GAE memcache "Undefined variable from import: get"
                            
                                Resident Set Size (RSS) limit has no effect
                            
                                howto uncompress gzipped data in a byte array?
                            
                                Relative imports in python 2.5
                            
                                Login to website using python
                            
                                Convert numbers to grades in python list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python - dealing with mixed-encoding files

Tags:

python

encoding

unicode

utf-8

windows-1252

Keith Hughitt

People also ask

2 Answers

jsbueno

AnthonyVO

Recent Activity

Donate For Us