In Python 2, Unicode strings may contain both unicode and bytes: <pre class="prettyprint"><code>a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba' </code></pre> I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with. The bytes in the string above are UTF-8 for <code>ек</code> (Unicode <code>\u0435\u043a</code>). My objective is to get a unicode string containing everything in Unicode, which is to say <code>Русский ек</code> (<code>\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a</code>). Encoding it to UTF-8 yields <pre class="prettyprint"><code>>>> a.encode('utf-8') '\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba' </code></pre> Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good: <pre class="prettyprint"><code>>>> a.encode('utf-8').decode('utf-8') u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba' </code></pre> I found a hacky way to solve the problem, however: <pre class="prettyprint"><code>>>> repr(a) "u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'" >>> eval(repr(a)[1:]) '\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba' >>> s = eval(repr(a)[1:]).decode('utf8') >>> s u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a' # Almost there, the bytes are proper now but the former real-unicode characters # are now escaped with \u's; need to un-escape them. >>> import re >>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s) u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success! </code></pre> This works fine but looks very hacky due to its use of <code>eval</code>, <code>repr</code>, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

<blockquote> In Python 2, Unicode strings may contain both unicode and bytes: </blockquote> No, they may not. They contain Unicode characters. Within the original string, <code>\xd0</code> is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. <code>u'\xd0'</code> == <code>u'\u00d0'</code>. It just happens that the <code>repr</code> for Unicode strings in Python 2 prefers to represent characters with <code>\x</code> escapes where possible (i.e. code points < 256). There is no way to look at the string and tell that the <code>\xd0</code> byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself. However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use <code>ord</code> to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

Bytes in a unicode Python string

Tags:

python

character-encoding

unicode

utf-8

In Python 2, Unicode strings may contain both unicode and bytes:

a = u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I understand that this is absolutely not something one should write in his own code, but this is a string that I have to deal with.

The bytes in the string above are UTF-8 for ек (Unicode \u0435\u043a).

My objective is to get a unicode string containing everything in Unicode, which is to say Русский ек (\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a).

Encoding it to UTF-8 yields

>>> a.encode('utf-8') '\xd0\xa0\xd1\x83\xd1\x81\xd1\x81\xd0\xba\xd0\xb8\xd0\xb9 \xc3\x90\xc2\xb5\xc3\x90\xc2\xba'

Which then decoded from UTF-8 gives the initial string with bytes in them, which is not good:

>>> a.encode('utf-8').decode('utf-8') u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \xd0\xb5\xd0\xba'

I found a hacky way to solve the problem, however:

>>> repr(a) "u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \\xd0\\xb5\\xd0\\xba'" >>> eval(repr(a)[1:]) '\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \xd0\xb5\xd0\xba' >>> s = eval(repr(a)[1:]).decode('utf8') >>> s u'\\u0420\\u0443\\u0441\\u0441\\u043a\\u0438\\u0439 \u0435\u043a' # Almost there, the bytes are proper now but the former real-unicode characters # are now escaped with \u's; need to un-escape them. >>> import re >>> re.sub(u'\\\\u([a-f\\d]+)', lambda x : unichr(int(x.group(1), 16)), s) u'\u0420\u0443\u0441\u0441\u043a\u0438\u0439 \u0435\u043a' # Success!

This works fine but looks very hacky due to its use of eval, repr, and then additional regex'ing of the unicode string representation. Is there a cleaner way?

589

asked Mar 23 '12 20:03

Etienne Perot

2 Answers

In Python 2, Unicode strings may contain both unicode and bytes:

No, they may not. They contain Unicode characters.

Within the original string, \xd0 is not a byte that's part of a UTF-8 encoding. It is the Unicode character with code point 208. u'\xd0' == u'\u00d0'. It just happens that the repr for Unicode strings in Python 2 prefers to represent characters with \x escapes where possible (i.e. code points < 256).

There is no way to look at the string and tell that the \xd0 byte is supposed to be part of some UTF-8 encoded character, or if it actually stands for that Unicode character by itself.

However, if you assume that you can always interpret those values as encoded ones, you could try writing something that analyzes each character in turn (use ord to convert to a code-point integer), decodes characters < 256 as UTF-8, and passes characters >= 256 as they were.

answered Oct 02 '22 19:10

Karl Knechtel

(In response to the comments above): this code converts everything that looks like utf8 and leaves other codepoints as is:

a = u'\u0420\u0443\u0441 utf:\xd0\xb5\xd0\xba bytes:bl\xe4\xe4'  def convert(s):     try:         return s.group(0).encode('latin1').decode('utf8')     except:         return s.group(0)  import re a = re.sub(r'[\x80-\xFF]+', convert, a) print a.encode('utf8')

Result:

Рус utf:ек bytes:blää

answered Oct 02 '22 17:10

georg

Related questions
                            
                                Vim and PEP 8 -- Style Guide for Python Code
                            
                                Getting values with the right type in Redis
                            
                                scipy minimize with constraints
                            
                                I know of f-strings, but what are r-strings? Are there others?
                            
                                Swap two rows in a numpy array in python [duplicate]
                            
                                How to get hard disk serial number using Python
                            
                                Override module method where from...import is used
                            
                                Get column name where value is something in pandas dataframe
                            
                                Tkinter messagebox without window?
                            
                                Python best practice in terms of logging
                            
                                Using an OrderedDict in **kwargs
                            
                                OpenCV resize fails on large image with "error: (-215) ssize.area() > 0 in function cv::resize"
                            
                                How to cache Django Rest Framework API calls?
                            
                                Group by two columns and count the occurrences of each combination in Pandas
                            
                                merging 2 dataframes vertically [duplicate]
                            
                                Unicode vs UTF-8 confusion in Python / Django?
                            
                                cursor.rowcount always -1 in sqlite3 in python3k
                            
                                for line in open(filename)
                            
                                Python Queue get()/task_done() issue
                            
                                Why won't re.groups() give me anything for my one correctly-matched group?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With