Let's say we have following strings containing emojis:
sent1 = 'š š right'
sent2 = 'Some text?! ššššš'
sent3 = 'š'
The task is to remove text and get the following output:
sent1_emojis = 'š š '
sent2_emojis = ' ššššš'
sent3_emojis = 'š'
Based on past question (Regex Emoji Unicode) I use the following regex to identify strings that contain at least one emoji:
emoji_pattern = re.compile(u".*(["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"])+", flags= re.UNICODE)
In order to get the output string I use the following:
re.match(emoji_pattern, sent1).group(0)
and so on.
There's a problem with the sent2 string. re.match(emoji_pattern, sent1).group(0) returns the whole sent2 instead of emojis only.
Little change in emoji_pattern will do the job:
emoji_pattern = re.compile(u"([" # .* removed
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
"])", flags= re.UNICODE) # + removed
for sent in [sent1, sent2, sent3]:
print(''.join(re.findall(emoji_pattern, sent)))
šš
ššššš
š
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With