I am trying to tokenize strings that have the two following patterns:
To do this, I have tried the word_tokenize()
function from nltk
(doc). However, it does not split the contiguous entities when emojis are involved.
For instance,
from nltk.tokenize import word_tokenize
word_tokenize("Hey, ๐๐ฅ")
output: ['Hey', ',', '๐๐ฅ']
I'd like to get: ['Hey', ',', '๐', '๐ฅ']
and
word_tokenize("surprise๐ฅ !!")
output: ['surprise๐ฅ', '!', '!']
I'd like to get ['surprise', '๐ฅ', '!', '!']
Therefore, I was thinking maybe using specific regex pattern could solve the issue but I don't know what pattern to use.
Try using TweetTokenizer
from nltk.tokenize.casual import TweetTokenizer
t = TweetTokenizer()
>>> t.tokenize("Hey, ๐๐ฅ")
['Hey', ',', '๐', '๐ฅ']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With