Dealing with a string containing multiple character encodings

Question

I'm not exactly sure how to ask this question really, and I'm no where close to finding an answer, so I hope someone can help me.

I'm writing a Python app that connects to a remote host and receives back byte data, which I unpack using Python's built-in struct module. My problem is with the strings, as they include multiple character encodings. Here is an example of such a string:

"^LThis is an example ^Gstring with multiple ^Jcharacter encodings"

Where the different encoding starts and ends is marked using special escape chars:

^L - Latin1
^E - Central Europe
^T - Turkish
^B - Baltic
^J - Japanese
^C - Cyrillic
^G - Greek

And so on... I need a way to convert this sort of string into Unicode, but I'm really not sure how to do it. I've read up on Python's codecs and string.encode/decode, but I'm none the wiser really. I should mention as well, that I have no control over how the strings are outputted by the host.

I hope someone can help me with how to get started on this.

zellyn · Accepted Answer

Here's a relatively simple example of how do it...

# -*- coding: utf-8 -*-
import re

# Test Data
ENCODING_RAW_DATA = (
    ('latin_1',    'L', u'Hello'),        # Latin 1
    ('iso8859_2',  'E', u'dobrý večer'),  # Central Europe
    ('iso8859_9',  'T', u'İyi akşamlar'), # Turkish
    ('iso8859_13', 'B', u'Į sveikatą!'),  # Baltic
    ('shift_jis',  'J', u'今日は'),        # Japanese
    ('iso8859_5',  'C', u'Здравствуйте'), # Cyrillic
    ('iso8859_7',  'G', u'Γειά σου'),   # Greek
)

CODE_TO_ENCODING = dict([(chr(ord(code)-64), encoding) for encoding, code, text in ENCODING_RAW_DATA])
EXPECTED_RESULT = u''.join([line[2] for line in ENCODING_RAW_DATA])
ENCODED_DATA = ''.join([chr(ord(code)-64) + text.encode(encoding) for encoding, code, text in ENCODING_RAW_DATA])

FIND_RE = re.compile('[\x00-\x1A][^\x00-\x1A]*')

def decode_single(bytes):
    return bytes[1:].decode(CODE_TO_ENCODING[bytes[0]])

result = u''.join([decode_single(bytes) for bytes in FIND_RE.findall(ENCODED_DATA)])

assert result==EXPECTED_RESULT, u"Expected %s, but got %s" % (EXPECTED_RESULT, result)

Dealing with a string containing multiple character encodings

Tags:

python

string

encoding

unicode

Alex McBride

1 Answers

zellyn

Recent Activity

Donate For Us

Dealing with a string containing multiple character encodings

Tags:

python

string

encoding

unicode

Alex McBride

1 Answers

zellyn

Related questions

Recent Activity

Donate For Us