Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace all emojis from a given unicode string

I have a list of unicode symbols from the emoji package. My end goal is to create a function that takes as input a unicode a string, i.e. some👩😌thing, and then removes all emojis, i.e. "something". Below is a demonstration of what I want to achieve:

from emoji import UNICODE_EMOJI
text = 'some👩😌thing'
exclude_list = UNICODE_EMOJI.keys()
output = ... = 'something'

I have been trying to do the above, and in that process, I came across a strange behavior which I demonstrate below, as you can see. I believe if the code below is fixed, then I will be able to achieve my end goal.

import regex as re
print u'\U0001F469'                     # 👩   
print u'\U0001F60C'                     # 😌    
print u'\U0001F469\U0001F60C'           # 👩😌 

text = u'some\U0001F469\U0001F60Cthing' 
print text                              # some👩😌thing

# Removing "👩😌" works
print re.sub(ur'[\U0001f469\U0001F60C]+', u'', text)  # something
# Removing only "👩" doesn't work 
print re.sub(ur'[\U0001f469]+', u'', text)            # some�thing
like image 781
dimitris93 Avatar asked Nov 19 '18 22:11

dimitris93


2 Answers

In most builds of Python 2.7, Unicode codepoints above 0x10000 are encoded as a surrogate pair, meaning Python actually sees them as two characters. You can prove this to yourself with len(u'\U0001F469').

The best way to solve this is to move to a version of Python that properly treats those codepoints as a single entity rather than a surrogate pair. You can compile Python 2.7 for this, and the recent versions of Python 3 will do it automatically.

To create a regular expression to use for the replace, simply join all the characters together with |. Since the list of characters already is encoded with surrogate pairs it will create the proper string.

subs = u'|'.join(exclude_list)
print re.sub(subs, u'', text)
like image 79
Mark Ransom Avatar answered Oct 05 '22 00:10

Mark Ransom


The old 2.7 regex engine gets confused because:

  1. Python 2.7 uses a forced word-based Unicode storage, in which certain Unicode codepoints are automatically substituted by surrogate pairs.

  2. Before the regex "sees" your Python string, Python already helpfully parsed your large Unicode codepoints into two separate characters (each on its own a valid – but incomplete – single Unicode character).

  3. That means that [\U0001f469]+' replaces something (a character class of 2 characters), but one of them is in your string and the other is not. That leads to your badly formed output.

This fixes it:

print re.sub(ur'(\U0001f469|U0001F60C)+', u'', text)  # something
# Removing only "👩" doesn't work 
print re.sub(ur'(\U0001f469)+', u'', text)            # some�thing
# .. and now it does:
some😌thing

because now the regex engine sees the exact same sequence of characters – surrogate pairs or otherwise – that you are looking for.

If you want to remove all emoji from the exclude_list, you can explicitly loop over its contents and replace one by one:

exclude_list = UNICODE_EMOJI.keys()

for bad in exclude_list:  # or simply "for bad in UNICODE_EMOJI" if you gotta catch them all
    if bad in text:
        print 'Removing '+bad
        text = text.replace(bad, '')
Removing 👩
Removing 😌
something

(This also shows the intermediate results as proof it works; you only need the replace line in the loop.)

like image 30
Jongware Avatar answered Oct 04 '22 23:10

Jongware