Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bytes in a unicode Python string

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba' 

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8') '\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba' 

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8') u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba' 

I found a hacky way to solve the problem, however:

>>> repr(a) "u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'" >>> eval(repr(a)[1:]) '\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba' >>> s = eval(repr(a)[1:]).decode('utf8') >>> s u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a' # Almost there, the bytes are proper now but the former real-unicode characters # are now escaped with \u's; need to un-escape them. >>> import re >>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s) u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success! 

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

like image 589
Etienne Perot Avatar asked Mar 23 '12 20:03

Etienne Perot


People also ask

How many bytes is a Unicode character?

Unicode uses two encoding forms: 8-bit and 16-bit, based on the data type of the data that is being that is being encoded. The default encoding form is 16-bit, where each character is 16 bits (2 bytes) wide. Sixteen-bit encoding form is usually shown as U+hhhh, where hhhh is the hexadecimal code point of the character.

How many bytes is a string character in Python?

2 bytes per char (UCS-2 encoding)

What is Unicode and bytes in Python?

Python 3 string class (str) stores Unicode strings and a new byte string (bytes) class supports single byte strings. The two are different types so string expressions must use one form or the other. String literals are Unicode unless prefixed with a lower case b.

What is Unicode string and byte string?

A character in a str represents one Unicode character. However, to represent more than 256 characters, individual Unicode encodings use more than one byte per character to represent many characters. bytes objects give you access to the underlying bytes.


2 Answers

In Python 2, Unicode strings may contain both unicode and bytes:

No, they may not. They contain Unicode characters.

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

like image 85
Karl Knechtel Avatar answered Oct 02 '22 19:10

Karl Knechtel


(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:

a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'  def convert(s):     try:         return s.group(0).encode('latin1').decode('utf8')     except:         return s.group(0)  import re a = re.sub(r'[\x80-\xFF]+', convert, a) print a.encode('utf8')    

Result:

Рус utf:ек bytes:blää   
like image 41
georg Avatar answered Oct 02 '22 17:10

georg