Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fixing faulty unicode strings

Tags:

python

unicode

A faulty unicode string is one that has accidentally encoded bytes in it. For example:

Text: שלום, Windows-1255-encoded: \x99\x8c\x85\x8d, Unicode: u'\u05e9\u05dc\u05d5\u05dd', Faulty Unicode: u'\x99\x8c\x85\x8d'

I sometimes bump into such strings when parsing ID3 tags in MP3 files. How can I fix these strings? (e.g. convert u'\x99\x8c\x85\x8d' into u'\u05e9\u05dc\u05d5\u05dd')

like image 803
iTayb Avatar asked Dec 29 '12 14:12

iTayb


People also ask

Why is Unicode not working?

If you are unable to read some Unicode characters in your browser, it may be because your system is not properly configured. Here are some basic instructions for doing that. There are two basic steps: Install fonts that cover the characters you need.

Is UTF-16 fixed?

UTF-16 isn't really fixed width; some Unicode code points are one 16-bit code unit, others require two 16-bit code units — just like UTF-8 isn't fixed width; some Unicode code points require one 8-bit code units, others require two, three or even four 8-bit code units (but not five or six, despite the comment from ...


1 Answers

You could convert u'\x99\x8c\x85\x8d' to '\x99\x8c\x85\x8d' using the latin-1 encoding:

In [9]: x = u'\x99\x8c\x85\x8d'

In [10]: x.encode('latin-1')
Out[10]: '\x99\x8c\x85\x8d'

However, it seems like this is not a valid Windows-1255-encoded string. Did you perhaps mean '\xf9\xec\xe5\xed'? If so, then

In [22]: x = u'\xf9\xec\xe5\xed'

In [23]: x.encode('latin-1').decode('cp1255')
Out[23]: u'\u05e9\u05dc\u05d5\u05dd'

converts u'\xf9\xec\xe5\xed' to u'\u05e9\u05dc\u05d5\u05dd' which matches the desired unicode you posted.


If you really want to convert u'\x99\x8c\x85\x8d' into u'\u05e9\u05dc\u05d5\u05dd', then this happens to work:

In [27]: u'\x99\x8c\x85\x8d'.encode('latin-1').decode('cp862')
Out[27]: u'\u05e9\u05dc\u05d5\u05dd'

The above encoding/decoding chain was found using this script:

guess_chain_encodings.py

"""
Usage example: guess_chain_encodings.py "u'баба'" "u'\xe1\xe0\xe1\xe0'"
"""
import six
import argparse
import binascii
import zlib
import utils_string as us
import ast
import collections
import itertools
import random

encodings = us.all_encodings()

Errors = (IOError, UnicodeEncodeError, UnicodeError, LookupError,
          TypeError, ValueError, binascii.Error, zlib.error)

def breadth_first_search(text, all = False):
    seen = set()
    tasks = collections.deque()
    tasks.append(([], text))
    while tasks:
        encs, text = tasks.popleft()
        for enc, newtext in candidates(text):
            if repr(newtext) not in seen:
                if not all:
                    seen.add(repr(newtext))
                newtask = encs+[enc], newtext
                tasks.append(newtask)
                yield newtask

def candidates(text):
    f = text.encode if isinstance(text, six.text_type) else text.decode
    results = []
    for enc in encodings:
        try:
            results.append((enc, f(enc)))
        except Errors as err:
            pass
    random.shuffle(results)
    for r in results:
        yield r

def fmt(encs, text):
    encode_decode = itertools.cycle(['encode', 'decode'])
    if not isinstance(text, six.text_type):
        next(encode_decode)
    chain = '.'.join( "{f}('{e}')".format(f = func, e = enc)
                     for enc, func in zip(encs, encode_decode) )
    return '{t!r}.{c}'.format(t = text, c = chain)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('start', type = ast.literal_eval, help = 'starting unicode')
    parser.add_argument('stop', type = ast.literal_eval, help = 'ending unicode')
    parser.add_argument('--all', '-a', action = 'store_true')    
    args = parser.parse_args()
    min_len = None
    for encs, text in breadth_first_search(args.start, args.all):
        if min_len is not None and len(encs) > min_len:
            break
        if type(text) == type(args.stop) and text == args.stop:
            print(fmt(encs, args.start))
            min_len = len(encs)

if __name__ == '__main__':
    main()

Running

% guess_chain_encodings.py "u'\x99\x8c\x85\x8d'" "u'\u05e9\u05dc\u05d5\u05dd'" --all

yields

u'\x99\x8c\x85\x8d'.encode('latin_1').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('charmap').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('rot_13').decode('cp856')

etc.

like image 148
unutbu Avatar answered Nov 15 '22 08:11

unutbu