I found this code in Python for removing emojis but it is not working. Can you help with other codes or fix to this?
I have observed all my emjois start with \xf
but when I try to search for str.startswith("\xf")
I get invalid character error.
emoji_pattern = r'/[x{1F601}-x{1F64F}]/u'
re.sub(emoji_pattern, '', word)
Here's the error:
Traceback (most recent call last):
File "test.py", line 52, in <module>
re.sub(emoji_pattern,'',word)
File "/usr/lib/python2.7/re.py", line 151, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "/usr/lib/python2.7/re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: bad character range
Each of the items in a list can be a word ['This', 'dog', '\xf0\x9f\x98\x82', 'https://t.co/5N86jYipOI']
UPDATE: I used this other code:
emoji_pattern=re.compile(ur" " " [\U0001F600-\U0001F64F] # emoticons \
|\
[\U0001F300-\U0001F5FF] # symbols & pictographs\
|\
[\U0001F680-\U0001F6FF] # transport & map symbols\
|\
[\U0001F1E0-\U0001F1FF] # flags (iOS)\
" " ", re.VERBOSE)
emoji_pattern.sub('', word)
But this still doesn't remove the emojis and shows them! Any clue why is that?
To remove the emojis, we set the parameter no_emoji to True .
To remove emojis from a string in Python, we can create a regex that matches a list of emojis. to call re. compile with pattern set to a string that matches the character code ranges for emojis. \U0001F600-\U0001F64F is the code range for emoticons.
Using 'str. replace() , we can replace a specific character. If we want to remove that specific character, replace that character with an empty string. The str. replace() method will replace all occurrences of the specific character mentioned.
Emojis can also be implemented by using the emoji module provided in Python. To install it run the following in the terminal. emojize() function requires the CLDR short name to be passed in it as the parameter. It then returns the corresponding emoji.
On Python 2, you have to use u''
literal to create a Unicode string. Also, you should pass re.UNICODE
flag and convert your input data to Unicode (e.g., text = data.decode('utf-8')
):
#!/usr/bin/env python
import re
text = u'This dog \U0001f602'
print(text) # with emoji
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags=re.UNICODE)
print(emoji_pattern.sub(r'', text)) # no emoji
This dog š
This dog
Note: emoji_pattern
matches only some emoji (not all). See Which Characters are Emoji.
Complete Version of remove Emojis
ā š· š šš» š„
import re
def remove_emojis(data):
emoj = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", re.UNICODE)
return re.sub(emoj, '', data)
I am updating my answer to this by @jfs because my previous answer failed to account for other Unicode standards such as Latin, Greek etc. StackOverFlow doesn't allow me to delete my previous answer hence I am updating it to match the most acceptable answer to the question.
#!/usr/bin/env python
import re
text = u'This is a smiley face \U0001f602'
print(text) # with emoji
def deEmojify(text):
regrex_pattern = re.compile(pattern = "["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"]+", flags = re.UNICODE)
return regrex_pattern.sub(r'',text)
print(deEmojify(text))
This was my previous answer, do not use this.
def deEmojify(inputString):
return inputString.encode('ascii', 'ignore').decode('ascii')
If you are not keen on using regex, the best solution could be using the emoji python package.
Here is a simple function to return emoji free text (thanks to this SO answer):
import emoji
def give_emoji_free_text(text):
allchars = [str for str in text.decode('utf-8')]
emoji_list = [c for c in allchars if c in emoji.UNICODE_EMOJI]
clean_text = ' '.join([str for str in text.decode('utf-8').split() if not any(i in str for i in emoji_list)])
return clean_text
If you are dealing with strings containing emojis, this is straightforward
>> s1 = "Hi š¤ How is your š and š. Have a nice weekend ššš"
>> print s1
Hi š¤ How is your š and š. Have a nice weekend ššš
>> print give_emoji_free_text(s1)
Hi How is your and Have a nice weekend
If you are dealing with unicode (as in the exmaple by @jfs), just encode it with utf-8.
>> s2 = u'This dog \U0001f602'
>> print s2
This dog š
>> print give_emoji_free_text(s2.encode('utf8'))
This dog
Edits
Based on the comment, it should be as easy as:
def give_emoji_free_text(text):
return emoji.get_emoji_regexp().sub(r'', text.decode('utf8'))
Complete vesrion Of remove emojies:
import re
def remove_emoji(string):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
"]+", flags=re.UNICODE)
return emoji_pattern.sub(r'', string)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With