Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert some character into five digit unicode one in Python 3.3?

I'd like to convert some character into five digit unicode on in Python 3.3. For example,

import re
print(re.sub('a', u'\u1D15D', 'abc' ))

but the result is different from what I expected. Do I have to put the character itself, not codepoint? Is there a better way to handle five digit unicode characters?

like image 385
user1610952 Avatar asked Jan 31 '13 11:01

user1610952


People also ask

How do you Unicode a character in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.

What does Unicode () do in Python?

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised.

How do I create a Unicode character?

Inserting Unicode characters To insert a Unicode character, type the character code, press ALT, and then press X. For example, to type a dollar symbol ($), type 0024, press ALT, and then press X. For more Unicode character codes, see Unicode character code charts by script.

Does Python 3 have Unicode?

In Python3, the default string is called Unicode string (u string), you can understand them as human-readable characters. As explained above, you can encode them to the byte string (b string), and the byte string can be decoded back to the Unicode string.


2 Answers

Python unicode escapes either are 4 hex digits (\uabcd) or 8 (\Uabcdabcd); for a codepoint beyond U+FFFF you need to use the latter (a capital U), make sure to left-fill with enough zeros:

>>> '\U0001D15D'
'𝅝'
>>> '\U0001D15D'.encode('unicode_escape')
b'\\U0001d15d'

(And yes, the U+1D15D codepoint (MUSICAL SYMBOL WHOLE NOTE) is in the above example, but your browser font may not be able to render it, showing a place-holder glyph (a box or question mark) instead.

Because you used a \uabcd escape, you replaced a in abc with two characters, the codepoint U+1D15 (, latin letter small capital ou), and the ASCII character D. Using a 32-bit unicode literal works:

>>> import re
>>> print(re.sub('a', '\U0001D15D', 'abc' ))
𝅝bc
>>> print(re.sub('a', u'\U0001D15D', 'abc' ).encode('unicode_escape'))
b'\\U0001d15dbc'

where again the U+1D15D codepoint could be displayed by your font as a placeholder glyph instead.

like image 118
Martijn Pieters Avatar answered Nov 15 '22 22:11

Martijn Pieters


By the way, you do not need the re module for this. You could use str.translate:

>>> 'abc'.translate({ord('a'):'\U0001D15D'})
'𝅝bc'
like image 20
unutbu Avatar answered Nov 15 '22 22:11

unutbu