Is it possible to tokenize emojis like :)
, :(
, ;~(
properly using the spaCy Python library? e.g. If I run the following code:
import spacy
nlp = spacy.load('en')
doc = nlp("Hello bright world :)")
And then visualize the doc with displaCy:
It incorrectly parses world :)
as one token. How can I modify spaCy so it recognizes these additional symbols? Thanks.
edit: Found the following: https://github.com/ines/spacymoji but I think it only supports Unicode emojis like ✨ and not ASCII ones like :)
?
Yes, spaCy actually includes a pretty comprehensive list of text-based emoticons as part of its tokenizer exceptions. So using your example above and printing the individual tokens, the emoticon is tokenized correctly:
doc = nlp("Hello bright world :)")
print([token.text for token in doc])
# ['Hello', 'bright', 'world', ':)']
I think what happens here is that you actually came across an interesting (maybe non-ideal) edge case with the displacy
defaults. To avoid very long dependency arcs for punctuation, the collapse_punct
setting defaults to True
. This means that when the visualisation is rendered, punctuation is merged onto the preceding token. Punctuation is identified by checking whether the token's is_punct
attribute returns True
– which also happens to be the case for ":)".
In your example, you can work around this by setting collapse_punct
to False
in the options passed to displacy.serve
:
displacy.serve(doc, style='dep', options={'collapse_punct': False})
(The displaCy visualizer should probably include an exception for emoticons when merging punctuation. This is currently difficult, because spaCy doesn't have an is_emoji
or is_symbol
flag. However, it might be a nice addition in the future – you can vote for it on this thread.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With