Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove emoji from string doesn't works for some cases

I am working on some data received from google big query which contains some special emoji in the data. I have a code that removes the emoji but it is not working for below specific emoji.

sample code that removes all emoji but not for the below case.

Using version Python 3.9

from re import UNICODE, compile
emoji_pattern = compile("["
                        u"\U0001F600-\U0001F64F"  # emoticons
                        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                        u"\U0001F680-\U0001F6FF"  # transport & map symbols
                        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                        u"\U0001F1F2-\U0001F1F4"  # Macau flag
                        u"\U0001F1E6-\U0001F1FF"  # flags
                        u"\U0001F600-\U0001F64F"
                        u"\U00002702-\U000027B0"
                        u"\U000024C2-\U0001F251"
                        u"\U0001f926-\U0001f937"
                        u"\U0001F1F2"
                        u"\U0001F1F4"
                        u"\U0001F620"
                        u"\u200d"
                        u"\u2640-\u2642"
                        "]+", flags=UNICODE)

# Works for this one 
data = 'support.google.co.uk/s/.💻'
result = emoji_pattern.subn(r'', data)
# result --> ('support.google.co.uk/s/.', 1)

# Doesn't work in this case
data = 'www.google.co.uk/?🤣'
result = emoji_pattern.subn(r'', data)
# result --> ('www.google.co.uk/?🤣', 0)

Can someone help me with this case. Also it would be much helpful if someone can help me how to check the Unicode representation for 🤣 (any special character or emoji) in python 3.9 so that I can update such unicode in the emoji pattern.

like image 754
Binit Amin Avatar asked Sep 05 '25 04:09

Binit Amin


2 Answers

check out this answer, the emoji python package seems like the best way to solve this problem.

to convert any emoji/character into UTF-8 do this:

import emoji
s = '🤣'
print(s.encode('unicode-escape').decode('ASCII'))

it'd print \U0001f600

like image 131
Rama Salahat Avatar answered Sep 07 '25 21:09

Rama Salahat


Modified emoji pattern list just for the reference.

emoji_pattern = compile("["
                        u"\U0001F600-\U0001F64F"  # emoticons
                        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                        u"\U0001F680-\U0001F6FF"  # transport & map symbols
                        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                        u"\U00002702-\U000027B0"
                        u"\U000024C2-\U0001F251"
                        u"\U0001f926-\U0001f937"
                        u"\U0001F1F2"
                        u"\U0001F1F4"
                        u"\U0001F620"
                        u"\u200d"
                        u"\u2640-\u2642"
                        u"\u2600-\u2B55"
                        u"\u23cf"
                        u"\u23e9"
                        u"\u231a"
                        u"\ufe0f"  # dingbats
                        u"\u3030"
                        u"\U00002500-\U00002BEF"  # Chinese char
                        u"\U00010000-\U0010ffff"
                        "]+", flags=UNICODE)

Thank you

like image 42
Binit Amin Avatar answered Sep 07 '25 21:09

Binit Amin