Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly print a list of unicode characters in python?

Tags:

I am trying to search for emoticons in python strings. So I have, for example,

em_test = ['\U0001f680']
print(em_test)
['πŸš€']
test = 'This is a test string πŸ’°πŸ’°πŸš€'
if any(x in test for x in em_test):
    print ("yes, the emoticon is there")
else: 
    print ("no, the emoticon is not there")

yes, the emoticon is there

and if a search em_test in

'This is a test string πŸ’°πŸ’°πŸš€'

I can actually find it.

So I have made a csv file with all the emoticons I want defined by their unicode. The CSV looks like this:

\U0001F600

\U0001F601

\U0001F602

\U0001F923

and when I import it and print it I actullay do not get the emoticons but rather just the text representation:

['\\U0001F600',
 '\\U0001F601',
 '\\U0001F602',
 '\\U0001F923',
...
]

and hence I cannot use this to search for these emoticons in another string... I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.

Any suggestions?

like image 709
Bullzeye Avatar asked Nov 13 '17 11:11

Bullzeye


2 Answers

You can decode those Unicode escape sequences with .decode('unicode-escape'). However, .decode is a bytes method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes rather than as text strings.

Just for fun, I'll also use unicodedata to get the names of those emojis.

import unicodedata as ud

emojis = [
    '\\U0001F600',
    '\\U0001F601',
    '\\U0001F602',
    '\\U0001F923',
]

for u in emojis:
    s = u.encode('ASCII').decode('unicode-escape')
    print(u, ud.name(s), s)

output

\U0001F600 GRINNING FACE πŸ˜€
\U0001F601 GRINNING FACE WITH SMILING EYES 😁
\U0001F602 FACE WITH TEARS OF JOY πŸ˜‚
\U0001F923 ROLLING ON THE FLOOR LAUGHING 🀣

This should be much faster than using ast.literal_eval. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII') call.

You can make the decoding a little more robust by using

u.encode('Latin1').decode('unicode-escape')

but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.

like image 95
PM 2Ring Avatar answered Oct 11 '22 12:10

PM 2Ring


1. keeping your csv as-is:

it's a bloated solution, but using ast.literal_eval works:

import ast

s = '\\U0001F600'

x = ast.literal_eval('"{}"'.format(s))
print(hex(ord(x)))
print(x)

I get 0x1f600 (which is correct char code) and some emoticon character (πŸ˜€). (well I had to copy/paste a strange char from my console to this answer textfield but that's a console issue by my end, otherwise that works)

just surround with quotes to allow ast to take the input as string.

2. using character codes directly

maybe you'd be better off by storing the character codes themselves instead of the \U format:

print(chr(0x1F600))

does exactly the same (so ast is slightly overkill)

your csv could contain:

0x1F600
0x1F601
0x1F602
0x1F923

then chr(int(row[0],16)) would do the trick when reading it: example if one 1 row in CSV (or first row)

with open("codes.csv") as f:
   cr = csv.reader(f)
   codes = [int(row[0],16) for row in cr]
like image 45
Jean-François Fabre Avatar answered Oct 11 '22 14:10

Jean-François Fabre