Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tokenizing emojis contiguous to words

Tags:

python

nlp

nltk

I am trying to tokenize strings that have the two following patterns:

  • contiguous emojis, for instance "Hey, ๐Ÿ˜๐Ÿ”ฅ"
  • emojis contiguous to words, for instance "surprise๐Ÿ’ฅ !!"

To do this, I have tried the word_tokenize() function from nltk (doc). However, it does not split the contiguous entities when emojis are involved.

For instance,

from nltk.tokenize import word_tokenize
word_tokenize("Hey, ๐Ÿ˜๐Ÿ”ฅ")

output: ['Hey', ',', '๐Ÿ˜๐Ÿ”ฅ']

I'd like to get: ['Hey', ',', '๐Ÿ˜', '๐Ÿ”ฅ']

and

word_tokenize("surprise๐Ÿ’ฅ !!")

output: ['surprise๐Ÿ’ฅ', '!', '!']

I'd like to get ['surprise', '๐Ÿ’ฅ', '!', '!']

Therefore, I was thinking maybe using specific regex pattern could solve the issue but I don't know what pattern to use.

like image 760
astiegler Avatar asked Sep 05 '25 03:09

astiegler


1 Answers

Try using TweetTokenizer

from nltk.tokenize.casual import TweetTokenizer
t = TweetTokenizer()
>>> t.tokenize("Hey, ๐Ÿ˜๐Ÿ”ฅ")
['Hey', ',', '๐Ÿ˜', '๐Ÿ”ฅ']
like image 93
hellpanderr Avatar answered Sep 07 '25 16:09

hellpanderr