I'd like to convert some character into five digit unicode on in Python 3.3. For example, <pre class="prettyprint"><code>import re print(re.sub('a', u'\u1D15D', 'abc' )) </code></pre> but the result is different from what I expected. Do I have to put the character itself, not codepoint? Is there a better way to handle five digit unicode characters?

Python unicode escapes either are 4 hex digits (<code>\uabcd</code>) or 8 (<code>\Uabcdabcd</code>); for a codepoint beyond U+FFFF you need to use the latter (a capital U), make sure to left-fill with enough zeros: <pre class="prettyprint"><code>>>> '\U0001D15D' '𝅝' >>> '\U0001D15D'.encode('unicode_escape') b'\\U0001d15d' </code></pre> (And yes, the U+1D15D codepoint (MUSICAL SYMBOL WHOLE NOTE) is in the above example, but your browser font may not be able to render it, showing a place-holder glyph (a box or question mark) instead. Because you used a <code>\uabcd</code> escape, you replaced <code>a</code> in <code>abc</code> with two characters, the codepoint U+1D15 (<code>ᴕ</code>, latin letter small capital ou), and the ASCII character <code>D</code>. Using a 32-bit unicode literal works: <pre class="prettyprint"><code>>>> import re >>> print(re.sub('a', '\U0001D15D', 'abc' )) 𝅝bc >>> print(re.sub('a', u'\U0001D15D', 'abc' ).encode('unicode_escape')) b'\\U0001d15dbc' </code></pre> where again the U+1D15D codepoint could be displayed by your font as a placeholder glyph instead.

By the way, you do not need the <code>re</code> module for this. You could use str.translate: <pre class="prettyprint"><code>>>> 'abc'.translate({ord('a'):'\U0001D15D'}) '𝅝bc' </code></pre>

How to convert some character into five digit unicode one in Python 3.3?

Tags:

python

regex

unicode

python-3.3

I'd like to convert some character into five digit unicode on in Python 3.3. For example,

import re
print(re.sub('a', u'\u1D15D', 'abc' ))

but the result is different from what I expected. Do I have to put the character itself, not codepoint? Is there a better way to handle five digit unicode characters?

385

asked Jan 31 '13 11:01

user1610952

2 Answers

Python unicode escapes either are 4 hex digits (\uabcd) or 8 (\Uabcdabcd); for a codepoint beyond U+FFFF you need to use the latter (a capital U), make sure to left-fill with enough zeros:

>>> '\U0001D15D'
'𝅝'
>>> '\U0001D15D'.encode('unicode_escape')
b'\\U0001d15d'

(And yes, the U+1D15D codepoint (MUSICAL SYMBOL WHOLE NOTE) is in the above example, but your browser font may not be able to render it, showing a place-holder glyph (a box or question mark) instead.

Because you used a \uabcd escape, you replaced a in abc with two characters, the codepoint U+1D15 (ᴕ, latin letter small capital ou), and the ASCII character D. Using a 32-bit unicode literal works:

>>> import re
>>> print(re.sub('a', '\U0001D15D', 'abc' ))
𝅝bc
>>> print(re.sub('a', u'\U0001D15D', 'abc' ).encode('unicode_escape'))
b'\\U0001d15dbc'

where again the U+1D15D codepoint could be displayed by your font as a placeholder glyph instead.

118

answered Nov 15 '22 22:11

Martijn Pieters

By the way, you do not need the re module for this. You could use str.translate:

>>> 'abc'.translate({ord('a'):'\U0001D15D'})
'𝅝bc'

answered Nov 15 '22 22:11

unutbu

Related questions
                            
                                numpy: applying argsort to an array
                            
                                How can I vectorize this triple-loop over 2d arrays in numpy?
                            
                                how to Iterate a mongo cursor in a loop in python
                            
                                passing variables to a template on a redirect in python
                            
                                Why does this python dictionary get created out of order using setdefault()?
                            
                                WeakValueDictionary retaining reference to object with no more strong references
                            
                                "after" looping indefinitely: never entering mainloop
                            
                                How to get Address from Latitude & Longitude in Django GeoIP?
                            
                                Cell assignment of a 2-dimensional Matrix in Python, without numpy
                            
                                Trouble querying ListField with mongoengine
                            
                                operator python parameter
                            
                                Is there a middle ground between `zip` and `zip_longest`
                            
                                How would you install a python module with chef?
                            
                                Using Pygame with PyPy
                            
                                Splitting a list of sequences into two lists efficiently [duplicate]
                            
                                Having line color vary with data index for line graph in matplotlib?
                            
                                how to use socket fetch webpage use python
                            
                                Why is collections.Counter much slower than ''.count?
                            
                                gaussian fit with scipy.optimize.curve_fit in python with wrong results
                            
                                Plugin architecture - Plugin Manager vs inspecting from plugins import *

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With