Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is the correct way to encode escape characters in Python 2 without killing Unicode?

I think I'm going crazy with Python's unicode strings. I'm trying to encode escape characters in a Unicode string without escaping actual Unicode characters. I'm getting this:

In [14]: a = u"Example\n"

In [15]: b = u"Пример\n"

In [16]: print a
Example


In [17]: print b
Пример


In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
\u041f\u0440\u0438\u043c\u0435\u0440\n

while I desperately need (English example works as I want, obviously):

In [18]: print a.encode('unicode_escape')
Example\n

In [19]: print b.encode('unicode_escape')
Пример\n

What should I do, short of moving to Python 3?

PS: As pointed out below, I'm actually seeking to escape control characters. Whether I need more than just those will have to be seen.

like image 392
Nikolai Prokoschenko Avatar asked Mar 19 '12 21:03

Nikolai Prokoschenko


2 Answers

Backslash escaping ascii control characters in the middle of unicode data is definitely a useful thing to try to accomplish. But it's not just escaping them, it's properly unescaping them when you want the actual character data back.

There should be a way to do this in the python stdlib, but there is not. I filed a bug report: http://bugs.python.org/issue18679

but in the mean time, here's a work around using translate and hackery:

tm = dict((k, repr(chr(k))[1:-1]) for k in range(32))
tm[0] = r'\0'
tm[7] = r'\a'
tm[8] = r'\b'
tm[11] = r'\v'
tm[12] = r'\f'
tm[ord('\\')] = '\\\\'

b = u"Пример\n"
c = b.translate(tm)
print(c) ## results in: Пример\n

All the non-backslash-single-letter control characters will be escaped with the \x## sequence, but if you need something different done with those, your translation matrix can do that. This approach is not lossy though, so it works for me.

But getting it back out is hacky too because you can't just translate character sequences back into single characters using translate.

d = c.encode('latin1', 'backslashreplace').decode('unicode_escape')
print(d) ## result in Пример with trailing newline character

you actually have to encode the characters that map to bytes individually using latin1 while backslash escaping unicode characters that latin1 doesn't know about so that the unicode_escape codec can handle reassembling everything the right way.

UPDATE:

So I had a case where I needed this to work in both python2.7 and python3.3. Here's what I did (buried in a _compat.py module):

if isinstance(b"", str):                                                        
    byte_types = (str, bytes, bytearray)                                        
    text_types = (unicode, )                                                    
    def uton(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntob(x): return x                                                       
    def ntou(x): return x.decode('utf-8', 'surrogateescape')                    
    def bton(x): return x
else:                                                                           
    byte_types = (bytes, bytearray)                                             
    text_types = (str, )                                                        
    def uton(x): return x                                                       
    def ntob(x): return x.encode('utf-8', 'surrogateescape')                    
    def ntou(x): return x                                                       
    def bton(x): return x.decode('utf-8', 'surrogateescape')    

escape_tm = dict((k, ntou(repr(chr(k))[1:-1])) for k in range(32))              
escape_tm[0] = u'\0'                                                            
escape_tm[7] = u'\a'                                                            
escape_tm[8] = u'\b'                                                            
escape_tm[11] = u'\v'                                                           
escape_tm[12] = u'\f'                                                           
escape_tm[ord('\\')] = u'\\\\'

def escape_control(s):                                                          
    if isinstance(s, text_types):                                               
        return s.translate(escape_tm)
    else:
        return s.decode('utf-8', 'surrogateescape').translate(escape_tm).encode('utf-8', 'surrogateescape')

def unescape_control(s):                                                        
    if isinstance(s, text_types):                                               
        return s.encode('latin1', 'backslashreplace').decode('unicode_escape')
    else:                                                                       
        return s.decode('utf-8', 'surrogateescape').encode('latin1', 'backslashreplace').decode('unicode_escape').encode('utf-8', 'surrogateescape')
like image 136
underrun Avatar answered Sep 28 '22 07:09

underrun


First let's correct the terminology. What you're trying to do is replace "control characters" with an equivalent "escape sequence".

I haven't been able to find any built-in method to do this, and nobody has yet posted one. Fortunately it's not a hard function to write.

control_chars = [unichr(c) for c in range(0x20)] # you may extend this as required

def control_escape(s):
    chars = []
    for c in s:
        if c in control_chars:
            chars.append(c.encode('unicode_escape'))
        else:
            chars.append(c)
    return u''.join(chars)

Or the slightly less readable one-liner version:

def control_escape2(s):
    return u''.join([c.encode('unicode_escape') if c in control_chars else c for c in s])
like image 40
Mark Ransom Avatar answered Sep 28 '22 06:09

Mark Ransom