Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract emoticons from a text

I need to extract text emoticons from a text using Python and I've been looking for some solutions to do this but most of them like this or this only cover simple emoticons. I need to parse all of them.

Currently I'm using a list of emoticons that I iterate for every text that I have process but this is so inefficient. Do you know a better solution? Maybe a Python library that can handle this problem?

like image 832
David Moreno García Avatar asked May 21 '15 10:05

David Moreno García


People also ask

How to get all emojis in a text file?

Step 1: Make sure that your text it's decoded on utf-8 text.decode ('utf-8') Step 2: Locate all emoji from your text, you must separate the text character by character [str for str in decode] Step 3: Saves all emoji in a list [c for c in allchars if c in emoji.UNICODE_EMOJI] full example bellow: Show activity on this post.

How to extract text from an image?

The Image to text converter enables you to extract the readable text from the image with one click. It scans the image by using the latest OCR technology and extracts every single piece of text written in the image. How to Convert Image to Text? To extract the text from the picture by using this online converter, follow the steps below:

What is an emoticon?

An emoticon is generally a pictorial representation or text format of facial expression of human beings which is used by people in messages or other text areas to convey their mood to other people. These faces are formed by the combining numbers, punctuation marks, alphabets, and fancy symbols.

How do you handle emojis/emoticons?

We can handle these in two ways- 1.By removing these from the texts. Removing the emojis/emoticons from the text for text analysis might not be a good decision. Sometimes, they can give strong information about a text such as feeling expression, especially in Sentiment Analysis and removing them might not be the right solution.


1 Answers

One of most efficient solution is to use Aho–Corasick string matching algorithm and is nontrivial algorithm designed for this kind of problem. (search of multiple predefined strings in unknown text)

There is package available for this.
https://pypi.python.org/pypi/ahocorasick/0.9
https://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Edit: There are also more recent packages available (haven tried any of them) https://pypi.python.org/pypi/pyahocorasick/1.0.0

Extra:
I have made some performance test with pyahocorasick and it is faster than python re when searching for more than 1 word in dict (2 or more).

Here it is code:

import re, ahocorasick,random,time

# search N words from dict
N=3

#file from http://norvig.com/big.txt
with open("big.txt","r") as f:
    text = f.read()

words = set(re.findall('[a-z]+', text.lower())) 
search_words = random.sample([w for w in words],N)

A = ahocorasick.Automaton()
for i,w in enumerate(search_words):
    A.add_word(w, (i, w))

A.make_automaton()
#test time for ahocorasic
start = time.time()
print("ah matches",sum(1 for i in A.iter(text))) 
print("aho done in ", time.time() - start)


exp = re.compile('|'.join(search_words))
#test time for re
start = time.time()
m = exp.findall(text)
print("re matches",sum(1 for _ in m))
print("re done in ",time.time()-start)
like image 178
Luka Rahne Avatar answered Oct 10 '22 05:10

Luka Rahne