Tokenizing emojis contiguous to words

Question

I am trying to tokenize strings that have the two following patterns:

contiguous emojis, for instance "Hey, 😍🔥"
emojis contiguous to words, for instance "surprise💥 !!"

To do this, I have tried the word_tokenize() function from nltk (doc). However, it does not split the contiguous entities when emojis are involved.

For instance,

from nltk.tokenize import word_tokenize
word_tokenize("Hey, 😍🔥")

output: ['Hey', ',', '😍🔥']

I'd like to get: ['Hey', ',', '😍', '🔥']

and

word_tokenize("surprise💥 !!")

output: ['surprise💥', '!', '!']

I'd like to get ['surprise', '💥', '!', '!']

Therefore, I was thinking maybe using specific regex pattern could solve the issue but I don't know what pattern to use.

hellpanderr · Accepted Answer

Try using TweetTokenizer

from nltk.tokenize.casual import TweetTokenizer
t = TweetTokenizer()
>>> t.tokenize("Hey, 😍🔥")
['Hey', ',', '😍', '🔥']

Tokenizing emojis contiguous to words

Tags:

python

nlp

nltk

astiegler

1 Answers

hellpanderr

Recent Activity

Donate For Us

Tokenizing emojis contiguous to words

Tags:

python

nlp

nltk

astiegler

1 Answers

hellpanderr

Related questions

Recent Activity

Donate For Us