Does anyone know the difference between nltk
's wordpunct_tokenize
and word_tokenize
? I'm using nltk=3.2.4
and there's nothing on the doc string of wordpunct_tokenize
that explains the difference. I couldn't find this info either in the documentation of nltk
(perhaps I didn't search in the right place!). I would have expected that first one would get rid of punctuation tokens or the like, but it doesn't.
wordpunct_tokenize
is based on a simple regexp tokenization. It is defined as
wordpunct_tokenize = WordPunctTokenizer().tokenize
which you can find here. Basically it uses the regular expression \w+|[^\w\s]+
to split the input.
word_tokenize
on the other hand is based on a TreebankWordTokenizer
, see the docs here. It basically tokenizes text like in the Penn Treebank. Here is a silly example that should show how the two differ.
sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re",
'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
"'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'",
'Hey', "',", 'she', "'", 'll', 'say', '!']
As we can see, wordpunct_tokenize
will split pretty much at all special symbols and treat them as separate units. word_tokenize
on the other hand keeps things like 're
together. It doesn't seem to be all that smart though, since as we can see it fails to separate the initial single quote from 'Hey'
.
Interestingly, if we write the sentence like this instead (single quotes as string delimiter and double quotes around "Hey"):
sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'
we get
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re",
'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't",
'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''",
',', 'she', "'ll", 'say', '!']
so word_tokenize
does split off double quotes, however it also converts them to ``
and ''
. wordpunct_tokenize
doesn't do this:
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
"'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"',
'Hey', '",', 'she', "'", 'll', 'say', '!']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With