I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:
u'that\u2019s \U0001f63b'
The apostrophe is represented by \u2019
, but the emoji is represented by \U0001f63b
. I looked up the code point for the emoji in question, and it's \uf63b
. I'm not sure where the 0001
is coming from. I know comically little about character encodings.
When I print the text, character by character, using:
s = u'that\u2019s \U0001f63b'
for c in s:
print c.encode('unicode_escape')
The program produces the following output:
t
h
a
t
\u2019
s
\ud83d
\ude3b
How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those 0001
s before reading it, or is there an easier, less silly way?
Emojis can also be implemented by using the emoji module provided in Python. To install it run the following in the terminal. emojize() function requires the CLDR short name to be passed in it as the parameter.
Emojis are also characters from the UTF-8 alphabet: π is 128516.
To print any character in the Python interpreter, use a \u to denote a unicode character and then follow with the character code.
Because emoji characters are treated as pictographs, they are encoded in Unicode based primarily on their general appearance, not on an intended semantic. The meaning of each emoji can vary depending on language, culture, context, and may change or be repurposed by various groups over time.
I don't think you're using encode correctly, nor do you need to. What you have is a valid unicode string with one 4 digit and one 8 digit escape sequence. Try this in the REPL on, say, OS X
>>> s = u'that\u2019s \U0001f63b'
>>> print s
thatβs π»
In python3, though -
Python 3.4.3 (default, Jul 7 2015, 15:40:07)
>>> s = u'that\u2019s \U0001f63b'
>>> s[-1]
'π»'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With