I am unable to understand the difference between the two. Though, I come to know that word_tokenize uses Penn-Treebank for tokenization purposes. But nothing on TweetTokenizer is available. For which sort of data should I be using TweetTokenizer over word_tokenize?
word_tokenize is a function in Python that splits a given sentence into words using the NLTK library. Figure 1 below shows the tokenization of sentence into words. Figure 1: Splitting of a sentence into words. In Python, we can tokenize with the help of the Natural Language Toolkit ( NLTK ) library.
NLTK has this special method called TweetTokenizer() that helps to tokenize Tweet Corpus into relevant tokens. The advantage of using TweetTokenizer() compared to regular word_tokenize is that, when processing tweets, we often come across emojis, hashtags that need to be handled differently.
Tokenization with NLTK This is a suite of libraries and programs for statistical natural language processing for English written in Python. NLTK contains a module called tokenize with a word_tokenize() method that will help us split a text into tokens. Once you installed NLTK, write the following code to tokenize text.
NLTK removes punctuation with a significant volume of textual data; we know how difficult it can be to discover and remove extraneous words or letters.
Well, both tokenizers almost work the same way, to split a given sentence into words. But you can think of TweetTokenizer
as a subset of word_tokenize
. TweetTokenizer
keeps hashtags intact while word_tokenize
doesn't.
I hope the below example will clear all your doubts...
from nltk.tokenize import TweetTokenizer
from nltk.tokenize import word_tokenize
tt = TweetTokenizer()
tweet = "This is a cooool #dummysmiley: :-) :-P <3 and some arrows < > -> <-- @remy: This is waaaaayyyy too much for you!!!!!!"
print(tt.tokenize(tweet))
print(word_tokenize(tweet))
# output
# ['This', 'is', 'a', 'cooool', '#dummysmiley', ':', ':-)', ':-P', '<3', 'and', 'some', 'arrows', '<', '>', '->', '<--', '@remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!']
# ['This', 'is', 'a', 'cooool', '#', 'dummysmiley', ':', ':', '-', ')', ':', '-P', '<', '3', 'and', 'some', 'arrows', '<', '>', '-', '>', '<', '--', '@', 'remy', ':', 'This', 'is', 'waaaaayyyy', 'too', 'much', 'for', 'you', '!', '!', '!', '!', '!', '!']
You can see that word_tokenize
has split #dummysmiley
as '#'
and 'dummysmiley'
, while TweetTokenizer didn't, as '#dummysmiley'
. TweetTokenizer
is built mainly for analyzing tweets.
You can learn more about tokenizer from this link
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With