I need to extract text emoticons from a text using Python and I've been looking for some solutions to do this but most of them like this or this only cover simple emoticons. I need to parse all of them. Currently I'm using a list of emoticons that I iterate for every text that I have process but this is so inefficient. Do you know a better solution? Maybe a Python library that can handle this problem?

One of most efficient solution is to use Aho–Corasick string matching algorithm and is nontrivial algorithm designed for this kind of problem. (search of multiple predefined strings in unknown text) There is package available for this. https://pypi.python.org/pypi/ahocorasick/0.9 https://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/ Edit: There are also more recent packages available (haven tried any of them) https://pypi.python.org/pypi/pyahocorasick/1.0.0 Extra: I have made some performance test with pyahocorasick and it is faster than python re when searching for more than 1 word in dict (2 or more). Here it is code: <pre class="prettyprint"><code>import re, ahocorasick,random,time # search N words from dict N=3 #file from http://norvig.com/big.txt with open("big.txt","r") as f: text = f.read() words = set(re.findall('[a-z]+', text.lower())) search_words = random.sample([w for w in words],N) A = ahocorasick.Automaton() for i,w in enumerate(search_words): A.add_word(w, (i, w)) A.make_automaton() #test time for ahocorasic start = time.time() print("ah matches",sum(1 for i in A.iter(text))) print("aho done in ", time.time() - start) exp = re.compile('|'.join(search_words)) #test time for re start = time.time() m = exp.findall(text) print("re matches",sum(1 for _ in m)) print("re done in ",time.time()-start) </code></pre>

Extract emoticons from a text

Tags:

python

regex

text-processing

emoticons

I need to extract text emoticons from a text using Python and I've been looking for some solutions to do this but most of them like this or this only cover simple emoticons. I need to parse all of them.

Currently I'm using a list of emoticons that I iterate for every text that I have process but this is so inefficient. Do you know a better solution? Maybe a Python library that can handle this problem?

832

asked May 21 '15 10:05

David Moreno García

1 Answers

One of most efficient solution is to use Aho–Corasick string matching algorithm and is nontrivial algorithm designed for this kind of problem. (search of multiple predefined strings in unknown text)

There is package available for this.
https://pypi.python.org/pypi/ahocorasick/0.9
https://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Edit: There are also more recent packages available (haven tried any of them) https://pypi.python.org/pypi/pyahocorasick/1.0.0

Extra:
I have made some performance test with pyahocorasick and it is faster than python re when searching for more than 1 word in dict (2 or more).

Here it is code:

Click to copy

import re, ahocorasick,random,time

# search N words from dict
N=3

#file from http://norvig.com/big.txt
with open("big.txt","r") as f:
    text = f.read()

words = set(re.findall('[a-z]+', text.lower())) 
search_words = random.sample([w for w in words],N)

A = ahocorasick.Automaton()
for i,w in enumerate(search_words):
    A.add_word(w, (i, w))

A.make_automaton()
#test time for ahocorasic
start = time.time()
print("ah matches",sum(1 for i in A.iter(text))) 
print("aho done in ", time.time() - start)


exp = re.compile('|'.join(search_words))
#test time for re
start = time.time()
m = exp.findall(text)
print("re matches",sum(1 for _ in m))
print("re done in ",time.time()-start)

178

answered Oct 10 '22 05:10

Luka Rahne

Related questions
                            
                                Install python package man pages with pip
                            
                                SQLite3 connection from StringIO (Python)
                            
                                Does Python define the value of "NaN > 0"?
                            
                                How could I "listen" for sounds on the internal motherboard speaker
                            
                                Python GUI programming, Licensing and Understanding
                            
                                Pygame installation fails due to requirement of System Python 2.7, even though I have Python 2.7
                            
                                Calling python module from Java
                            
                                Flask sse-stream not terminated after firefox disconnects
                            
                                Matplotlib figure facecolor alpha while saving (background color, transparency)
                            
                                Odoo/OpenERP: hiding create button from treeview
                            
                                Why is "not" faster than "bool()" in Python (or, speed of Python functions vs. statements)?
                            
                                Cython vs numpy performance scaling
                            
                                Calculate the Fourier series with the trigonometry approach
                            
                                functools.wraps equivalent for class decorator
                            
                                Using python's Multiprocessing makes response hang on gunicorn
                            
                                How to tell if boto is using SSLv3 or TLS?
                            
                                the filter of sniff function in scapy does not work properly
                            
                                Sublime Text 3 REPL - Open program in same REPL window
                            
                                Django: How to access test database?
                            
                                Django: Use LayerMapping to update an existing model?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With