A faulty unicode string is one that has accidentally encoded bytes in it. For example:
Text: שלום
, Windows-1255-encoded: \x99\x8c\x85\x8d
, Unicode: u'\u05e9\u05dc\u05d5\u05dd'
, Faulty Unicode: u'\x99\x8c\x85\x8d'
I sometimes bump into such strings when parsing ID3 tags in MP3 files. How can I fix these strings? (e.g. convert u'\x99\x8c\x85\x8d'
into u'\u05e9\u05dc\u05d5\u05dd'
)
If you are unable to read some Unicode characters in your browser, it may be because your system is not properly configured. Here are some basic instructions for doing that. There are two basic steps: Install fonts that cover the characters you need.
UTF-16 isn't really fixed width; some Unicode code points are one 16-bit code unit, others require two 16-bit code units — just like UTF-8 isn't fixed width; some Unicode code points require one 8-bit code units, others require two, three or even four 8-bit code units (but not five or six, despite the comment from ...
You could convert u'\x99\x8c\x85\x8d'
to '\x99\x8c\x85\x8d'
using the latin-1
encoding:
In [9]: x = u'\x99\x8c\x85\x8d'
In [10]: x.encode('latin-1')
Out[10]: '\x99\x8c\x85\x8d'
However, it seems like this is not a valid Windows-1255-encoded string. Did you perhaps mean '\xf9\xec\xe5\xed'
? If so, then
In [22]: x = u'\xf9\xec\xe5\xed'
In [23]: x.encode('latin-1').decode('cp1255')
Out[23]: u'\u05e9\u05dc\u05d5\u05dd'
converts u'\xf9\xec\xe5\xed'
to u'\u05e9\u05dc\u05d5\u05dd'
which matches the desired unicode you posted.
If you really want to convert u'\x99\x8c\x85\x8d'
into u'\u05e9\u05dc\u05d5\u05dd'
, then this happens to work:
In [27]: u'\x99\x8c\x85\x8d'.encode('latin-1').decode('cp862')
Out[27]: u'\u05e9\u05dc\u05d5\u05dd'
The above encoding/decoding chain was found using this script:
guess_chain_encodings.py
"""
Usage example: guess_chain_encodings.py "u'баба'" "u'\xe1\xe0\xe1\xe0'"
"""
import six
import argparse
import binascii
import zlib
import utils_string as us
import ast
import collections
import itertools
import random
encodings = us.all_encodings()
Errors = (IOError, UnicodeEncodeError, UnicodeError, LookupError,
TypeError, ValueError, binascii.Error, zlib.error)
def breadth_first_search(text, all = False):
seen = set()
tasks = collections.deque()
tasks.append(([], text))
while tasks:
encs, text = tasks.popleft()
for enc, newtext in candidates(text):
if repr(newtext) not in seen:
if not all:
seen.add(repr(newtext))
newtask = encs+[enc], newtext
tasks.append(newtask)
yield newtask
def candidates(text):
f = text.encode if isinstance(text, six.text_type) else text.decode
results = []
for enc in encodings:
try:
results.append((enc, f(enc)))
except Errors as err:
pass
random.shuffle(results)
for r in results:
yield r
def fmt(encs, text):
encode_decode = itertools.cycle(['encode', 'decode'])
if not isinstance(text, six.text_type):
next(encode_decode)
chain = '.'.join( "{f}('{e}')".format(f = func, e = enc)
for enc, func in zip(encs, encode_decode) )
return '{t!r}.{c}'.format(t = text, c = chain)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('start', type = ast.literal_eval, help = 'starting unicode')
parser.add_argument('stop', type = ast.literal_eval, help = 'ending unicode')
parser.add_argument('--all', '-a', action = 'store_true')
args = parser.parse_args()
min_len = None
for encs, text in breadth_first_search(args.start, args.all):
if min_len is not None and len(encs) > min_len:
break
if type(text) == type(args.stop) and text == args.stop:
print(fmt(encs, args.start))
min_len = len(encs)
if __name__ == '__main__':
main()
Running
% guess_chain_encodings.py "u'\x99\x8c\x85\x8d'" "u'\u05e9\u05dc\u05d5\u05dd'" --all
yields
u'\x99\x8c\x85\x8d'.encode('latin_1').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('charmap').decode('cp862')
u'\x99\x8c\x85\x8d'.encode('rot_13').decode('cp856')
etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With