I'd like to convert some character into five digit unicode on in Python 3.3. For example,
import re
print(re.sub('a', u'\u1D15D', 'abc' ))
but the result is different from what I expected. Do I have to put the character itself, not codepoint? Is there a better way to handle five digit unicode characters?
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.
Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.
In Python3, the default string is called Unicode string (u string), you can understand them as human-readable characters. As explained above, you can encode them to the byte string (b string), and the byte string can be decoded back to the Unicode string.
Python unicode escapes either are 4 hex digits (\uabcd
) or 8 (\Uabcdabcd
); for a codepoint beyond U+FFFF you need to use the latter (a capital U), make sure to left-fill with enough zeros:
>>> '\U0001D15D'
'𝅝'
>>> '\U0001D15D'.encode('unicode_escape')
b'\\U0001d15d'
(And yes, the U+1D15D codepoint (MUSICAL SYMBOL WHOLE NOTE) is in the above example, but your browser font may not be able to render it, showing a place-holder glyph (a box or question mark) instead.
Because you used a \uabcd
escape, you replaced a
in abc
with two characters, the codepoint U+1D15 (ᴕ
, latin letter small capital ou), and the ASCII character D
. Using a 32-bit unicode literal works:
>>> import re
>>> print(re.sub('a', '\U0001D15D', 'abc' ))
𝅝bc
>>> print(re.sub('a', u'\U0001D15D', 'abc' ).encode('unicode_escape'))
b'\\U0001d15dbc'
where again the U+1D15D codepoint could be displayed by your font as a placeholder glyph instead.
By the way, you do not need the re
module for this. You could use str.translate:
>>> 'abc'.translate({ord('a'):'\U0001D15D'})
'𝅝bc'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With