Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

detect/remove unpaired surrogate character in Python 2 + GTK

In Python 2.7 I can successfully convert the Unicode string "abc\udc34xyz" to UTF-8 (result is "abc\xed\xb0\xb4xyz"). But when I pass the UTF-8 string to eg. pango_parse_markup() or g_convert_with_fallback(), I get errors like "Invalid byte sequence in conversion input". Apparently the GTK/Pango functions detect the "unpaired surrogate" in the string and (correctly?) reject it.

Python 3 doesn't even allow conversion of the Unicode string to UTF-8 (error: "'utf-8' codec can't encode character '\udc34' in position 3: surrogates not allowed"), but I can run "abc\udc34xyz".encode("utf8", "replace") to get a valid UTF8 string with the lone surrogate replaced by some other character. That's fine for me, but I need a solution for Python 2.

So the question is: in Python 2.7, how can I convert that Unicode string to UTF-8 while replacing the lone surrogate with some replacement character like U+FFFD? Preferably only standard Python functions and GTK/GLib/G... functions should be used.

Btw. Iconv can convert the string to UTF8 but simply removes the bad character instead of replacing it with U+FFFD.

like image 985
oliver Avatar asked Sep 07 '13 12:09

oliver


1 Answers

You can do the replacements yourself before encoding:

import re

lone = re.compile(
    ur'''(?x)            # verbose expression (allows comments)
    (                    # begin group
    [\ud800-\udbff]      #   match leading surrogate
    (?![\udc00-\udfff])  #   but only if not followed by trailing surrogate
    )                    # end group
    |                    #  OR
    (                    # begin group
    (?<![\ud800-\udbff]) #   if not preceded by leading surrogate
    [\udc00-\udfff]      #   match trailing surrogate
    )                    # end group
    ''')

u = u'abc\ud834\ud82a\udfcdxyz'
print repr(u)
b = lone.sub(ur'\ufffd',u).encode('utf8')
print repr(b)
print repr(b.decode('utf8'))

Output:

u'abc\ud834\U0001abcdxyz'
'abc\xef\xbf\xbd\xf0\x9a\xaf\x8dxyz'
u'abc\ufffd\U0001abcdxyz'
like image 70
Mark Tolonen Avatar answered Sep 27 '22 23:09

Mark Tolonen