Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to parse emojis using spaCy?

Is it possible to tokenize emojis like :), :(, ;~( properly using the spaCy Python library? e.g. If I run the following code:

import spacy

nlp = spacy.load('en')
doc = nlp("Hello bright world :)")

And then visualize the doc with displaCy:

enter image description here

It incorrectly parses world :) as one token. How can I modify spaCy so it recognizes these additional symbols? Thanks.

edit: Found the following: https://github.com/ines/spacymoji but I think it only supports Unicode emojis like ✨ and not ASCII ones like :)?

like image 297
James Ko Avatar asked Dec 18 '22 00:12

James Ko


1 Answers

Yes, spaCy actually includes a pretty comprehensive list of text-based emoticons as part of its tokenizer exceptions. So using your example above and printing the individual tokens, the emoticon is tokenized correctly:

doc = nlp("Hello bright world :)")
print([token.text for token in doc])
# ['Hello', 'bright', 'world', ':)']

I think what happens here is that you actually came across an interesting (maybe non-ideal) edge case with the displacy defaults. To avoid very long dependency arcs for punctuation, the collapse_punct setting defaults to True. This means that when the visualisation is rendered, punctuation is merged onto the preceding token. Punctuation is identified by checking whether the token's is_punct attribute returns True – which also happens to be the case for ":)".

In your example, you can work around this by setting collapse_punct to False in the options passed to displacy.serve:

displacy.serve(doc, style='dep', options={'collapse_punct': False})

(The displaCy visualizer should probably include an exception for emoticons when merging punctuation. This is currently difficult, because spaCy doesn't have an is_emoji or is_symbol flag. However, it might be a nice addition in the future – you can vote for it on this thread.)

like image 120
Ines Montani Avatar answered Dec 27 '22 02:12

Ines Montani