I am trying to search for emoticons in python strings. So I have, for example,
em_test = ['\U0001f680']
print(em_test)
['π']
test = 'This is a test string π°π°π'
if any(x in test for x in em_test):
print ("yes, the emoticon is there")
else:
print ("no, the emoticon is not there")
yes, the emoticon is there
and if a search em_test in
'This is a test string π°π°π'
I can actually find it.
So I have made a csv file with all the emoticons I want defined by their unicode. The CSV looks like this:
\U0001F600
\U0001F601
\U0001F602
\U0001F923
and when I import it and print it I actullay do not get the emoticons but rather just the text representation:
['\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
...
]
and hence I cannot use this to search for these emoticons in another string... I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.
Any suggestions?
You can decode those Unicode escape sequences with .decode('unicode-escape')
. However, .decode
is a bytes
method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes
rather than as text strings.
Just for fun, I'll also use unicodedata
to get the names of those emojis.
import unicodedata as ud
emojis = [
'\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
]
for u in emojis:
s = u.encode('ASCII').decode('unicode-escape')
print(u, ud.name(s), s)
output
\U0001F600 GRINNING FACE π
\U0001F601 GRINNING FACE WITH SMILING EYES π
\U0001F602 FACE WITH TEARS OF JOY π
\U0001F923 ROLLING ON THE FLOOR LAUGHING π€£
This should be much faster than using ast.literal_eval
. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII')
call.
You can make the decoding a little more robust by using
u.encode('Latin1').decode('unicode-escape')
but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.
1. keeping your csv as-is:
it's a bloated solution, but using ast.literal_eval
works:
import ast
s = '\\U0001F600'
x = ast.literal_eval('"{}"'.format(s))
print(hex(ord(x)))
print(x)
I get 0x1f600
(which is correct char code) and some emoticon character (π). (well I had to copy/paste a strange char from my console to this answer textfield but that's a console issue by my end, otherwise that works)
just surround with quotes to allow ast
to take the input as string.
2. using character codes directly
maybe you'd be better off by storing the character codes themselves instead of the \U
format:
print(chr(0x1F600))
does exactly the same (so ast
is slightly overkill)
your csv could contain:
0x1F600
0x1F601
0x1F602
0x1F923
then chr(int(row[0],16))
would do the trick when reading it: example if one 1 row in CSV (or first row)
with open("codes.csv") as f:
cr = csv.reader(f)
codes = [int(row[0],16) for row in cr]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With