Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract all the emojis from text?

Consider the following list:

a_list = ['πŸ€” πŸ™ˆ me asΓ­, bla es se 😌 ds πŸ’•πŸ‘­πŸ‘™']

How can I extract in a new list all the emojis inside a_list?:

new_lis = ['πŸ€” πŸ™ˆ 😌 πŸ’• πŸ‘­ πŸ‘™']

I tried to use regex, but I do not have all the possible emojis encodings.

like image 327
tumbleweed Avatar asked Mar 31 '17 17:03

tumbleweed


People also ask

Can Python read Emojis?

Emojis can also be implemented by using the emoji module provided in Python. To install it run the following in the terminal. emojize() function requires the CLDR short name to be passed in it as the parameter.


2 Answers

You can use the emoji library. You can check if a single codepoint is an emoji codepoint by checking if it is contained in emoji.UNICODE_EMOJI.

import emoji

def extract_emojis(s):
  return ''.join(c for c in s if c in emoji.UNICODE_EMOJI['en'])
like image 198
Pedro Castilho Avatar answered Oct 26 '22 05:10

Pedro Castilho


I think it's important to point out that the previous answers won't work with emojis like πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ , because it consists of 4 emojis, and using ... in emoji.UNICODE_EMOJI will return 4 different emojis. Same for emojis with skin color like πŸ™…πŸ½.

My solution

Include the emoji and regex modules. The regex module supports recognizing grapheme clusters (sequences of Unicode codepoints rendered as a single character), so we can count emojis like πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦

import emoji
import regex

def split_count(text):

    emoji_list = []
    data = regex.findall(r'\X', text)
    for word in data:
        if any(char in emoji.UNICODE_EMOJI['en'] for char in word):
            emoji_list.append(word)
    
    return emoji_list

Testing

with more emojis with skin color:

line = ["πŸ€” πŸ™ˆ me asΓ­, se 😌 ds πŸ’•πŸ‘­πŸ‘™ hello πŸ‘©πŸΎβ€πŸŽ“ emoji hello πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ how are 😊 you todayπŸ™…πŸ½πŸ™…πŸ½"]

counter = split_count(line[0])
print(' '.join(emoji for emoji in counter))

output:

πŸ€” πŸ™ˆ 😌 πŸ’• πŸ‘­ πŸ‘™ πŸ‘©πŸΎβ€πŸŽ“ πŸ‘¨β€πŸ‘©β€πŸ‘¦β€πŸ‘¦ 😊 πŸ™…πŸ½ πŸ™…πŸ½

Include flags

If you want to include flags, like πŸ‡΅πŸ‡° the Unicode range would be from πŸ‡¦ to πŸ‡Ώ, so add:

flags = regex.findall(u'[\U0001F1E6-\U0001F1FF]', text) 

to the function above, and return emoji_list + flags.

See this answer to "A python regex that matches the regional indicator character class" for more information about the flags.

For newer emoji versions

to work with emoji >= v1.2.0 you have to add a language specifier (e.g. en as in above code):

emoji.UNICODE_EMOJI['en']
like image 45
sheldonzy Avatar answered Oct 26 '22 06:10

sheldonzy