How nltk.TweetTokenizer different from nltk.word_tokenize?

Tags:

I am unable to understand the difference between the two. Though, I come to know that word_tokenize uses Penn-Treebank for tokenization purposes. But nothing on TweetTokenizer is available. For which sort of data should I be using TweetTokenizer over word_tokenize?

912

asked May 20 '20 17:05

Mehul Gupta

1 Answers

Well, both tokenizers almost work the same way, to split a given sentence into words. But you can think of TweetTokenizer as a subset of word_tokenize. TweetTokenizer keeps hashtags intact while word_tokenize doesn't.

I hope the below example will clear all your doubts...

from nltk.tokenize import TweetTokenizer
from nltk.tokenize import  word_tokenize
tt = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaaayyyy too much for you!!!!!!"
print(tt.tokenize(tweet))
print(word_tokenize(tweet))

# output
# ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']
# ['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']

You can see that word_tokenize has split #dummysmiley as '#' and 'dummysmiley', while TweetTokenizer didn't, as '#dummysmiley'. TweetTokenizer is built mainly for analyzing tweets. You can learn more about tokenizer from this link

answered Oct 10 '22 21:10

Darkknight

Related questions
                            
                                Inconsistent alignment of title and suptitle in matplotlib
                            
                                Intersection over union on non rectangular quadrilaterals
                            
                                Masking tensor of same shape in PyTorch
                            
                                Runtime error 999 when trying to use cuda with pytorch
                            
                                In Python what is it called when you see the output of a variable without printing it?
                            
                                AttributeError: module 'tensorflow.python.keras.backend' has no attribute 'get_graph'
                            
                                Pandas: Transform dataframe to show if a combination of values exists in the orignal Dataframe
                            
                                Check if value is one of a given number of values - set vs. tuple vs. list
                            
                                Regex pattern recursively - in python
                            
                                Filter numeric column by dictionary of value ranges
                            
                                Tensorflow not recognising cudart64_101.dll
                            
                                Forward fill column on condition [closed]
                            
                                Multiple iterators (using enumerate) for the same iterable, what is going on?
                            
                                ModuleNotFoundError: No module named 'jose'
                            
                                Django model objects became not hashable after upgrading to django 2.2
                            
                                Epoch 1/2 103/Unknown - 8s 80ms/step - loss: 0.0175 (model.fit() keeps running forever even after crossing the total number of training images)
                            
                                Most Efficient Method to Concatenate Strings in Python
                            
                                Better way to plot a dataFrame on a plotly Table
                            
                                Combine audio files in Python
                            
                                pytest exec code in self.locals SyntaxError: Missing parentheses in call to 'exec'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How nltk.TweetTokenizer different from nltk.word_tokenize?

Tags:

python

artificial-intelligence

tokenize

nlp

nltk

Mehul Gupta

People also ask

1 Answers

Darkknight

Recent Activity

Donate For Us