Fixing faulty unicode strings

1 Answers

You could convert u'\x99\x8c\x85\x8d' to '\x99\x8c\x85\x8d' using the latin-1 encoding:

In [9]: x = u'\x99\x8c\x85\x8d'

In [10]: x.encode('latin-1')
Out[10]: '\x99\x8c\x85\x8d'

However, it seems like this is not a valid Windows-1255-encoded string. Did you perhaps mean '\xf9\xec\xe5\xed'? If so, then

In [22]: x = u'\xf9\xec\xe5\xed'

In [23]: x.encode('latin-1').decode('cp1255')
Out[23]: u'\u05e9\u05dc\u05d5\u05dd'

converts u'\xf9\xec\xe5\xed' to u'\u05e9\u05dc\u05d5\u05dd' which matches the desired unicode you posted.

If you really want to convert u'\x99\x8c\x85\x8d' into u'\u05e9\u05dc\u05d5\u05dd', then this happens to work:

In [27]: u'\x99\x8c\x85\x8d'.encode('latin-1').decode('cp862')
Out[27]: u'\u05e9\u05dc\u05d5\u05dd'

The above encoding/decoding chain was found using this script:

guess_chain_encodings.py

"""
Usage example: guess_chain_encodings.py "u'баба'" "u'\xe1\xe0\xe1\xe0'"
"""
import six
import argparse
import binascii
import zlib
import utils_string as us
import ast
import collections
import itertools
import random

encodings = us.all_encodings()

Errors = (IOError, UnicodeEncodeError, UnicodeError, LookupError,
          TypeError, ValueError, binascii.Error, zlib.error)

def breadth_first_search(text, all = False):
    seen = set()
    tasks = collections.deque()
    tasks.append(([], text))
    while tasks:
        encs, text = tasks.popleft()
        for enc, newtext in candidates(text):
            if repr(newtext) not in seen:
                if not all:
                    seen.add(repr(newtext))
                newtask = encs+[enc], newtext
                tasks.append(newtask)
                yield newtask

def candidates(text):
    f = text.encode if isinstance(text, six.text_type) else text.decode
    results = []
    for enc in encodings:
        try:
            results.append((enc, f(enc)))
        except Errors as err:
            pass
    random.shuffle(results)
    for r in results:
        yield r

def fmt(encs, text):
    encode_decode = itertools.cycle(['encode', 'decode'])
    if not isinstance(text, six.text_type):
        next(encode_decode)
    chain = '.'.join( "{f}('{e}')".format(f = func, e = enc)
                     for enc, func in zip(encs, encode_decode) )
    return '{t!r}.{c}'.format(t = text, c = chain)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('start', type = ast.literal_eval, help = 'starting unicode')
    parser.add_argument('stop', type = ast.literal_eval, help = 'ending unicode')
    parser.add_argument('--all', '-a', action = 'store_true')    
    args = parser.parse_args()
    min_len = None
    for encs, text in breadth_first_search(args.start, args.all):
        if min_len is not None and len(encs) > min_len:
            break
        if type(text) == type(args.stop) and text == args.stop:
            print(fmt(encs, args.start))
            min_len = len(encs)

if __name__ == '__main__':
    main()

Running

% guess_chain_encodings.py "u'\x99\x8c\x85\x8d'" "u'\u05e9\u05dc\u05d5\u05dd'" --all

yields

u'\x99\x8c\x85\x8d'.encode('latin_1').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('charmap').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('rot_13').decode('cp856')

etc.

148

answered Nov 15 '22 08:11

unutbu

Related questions
                            
                                Basic Event Loop in Python [duplicate]
                            
                                Code to detect all words that start with a capital letter in a string
                            
                                Correct use of static methods
                            
                                Create and lookup 2D dictionary with multiple keys per value
                            
                                Powershell Python: Change version used
                            
                                matplotlib sequence of figures in the same window
                            
                                Parsing unclosed `<br>` tags with BeautifulSoup
                            
                                This character - ㎜ - raises a UnicodeEncodeError
                            
                                Finding Sum of a Column in a List Getting "TypeError: cannot perform reduce with flexible type"
                            
                                How to implement optional first argument (to reproduce slice() behavior) [duplicate]
                            
                                Elegant way to safely .text.strip() in BeautifulSoup?
                            
                                Recursion on Fibonacci Sequence
                            
                                How to pass multiple variable from php to python script
                            
                                Get element inside current element using xpath
                            
                                Model by name in SQLAlchemy
                            
                                Setting UAC to requireAdministrator using PyInstaller onefile option and manifest
                            
                                For-loops in Python 3.0
                            
                                Why are defaults not appearing in my command-line argument dictionary from docopt?
                            
                                Python Module for Session Management
                            
                                Comparing lists by reference vs value in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fixing faulty unicode strings

Tags:

python

unicode

iTayb

People also ask

1 Answers

unutbu

Recent Activity

Donate For Us