NLTK words vs word

I'm exploring some of NLTK's corpora and came across the following behaviour: word_tokenize() and words produce different sets of words().

Here is an example using webtext:

from nltk.corpus import webtext

When I run the following,

len(set(word_tokenize(webtext.raw('wine.txt'))))

I get: 3488

When I run the following,

len(set(webtext.words('wine.txt')))

I get: 3414

All I can find in the documentation is that word_tokenize is a list of punctuation and words. But it also says words is a list of punctuation and words. I'm wondering, what's going on here? Why are they different?

I've already tried looking at the set differences.

U = set(word_tokenize(webtext.raw('wine.txt')))
V = set(webtext.words('wine.txt'))

tok_not_in_words = U.difference(V) # in tokenize but not in words
words_not_in_tok = V.difference(U) # in words but not in tokenize

All I can see is that word_tokenize contains hyphenated words and words splits the hyphenated words.

Any help is appreciated. Thank you!

What is NLTK word_tokenize?

word_tokenize is a function in Python that splits a given sentence into words using the NLTK library. Figure 1 below shows the tokenization of sentence into words. Figure 1: Splitting of a sentence into words. In Python, we can tokenize with the help of the Natural Language Toolkit ( NLTK ) library.

What is the difference between sentence tokenization and word tokenization?

Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.

What is the best tokenizer NLP?

Whitespace tokenization This is the most simple and commonly used form of tokenization. It splits the text whenever it finds whitespace characters. It is advantageous since it is a quick and easily understood method of tokenization. However, due to its simplicity, it does not take special cases into account.

First lets take a look at the count the tokens from both approach and see the most_common words:

>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk.corpus import webtext

>>> counts_from_wordtok = Counter(word_tokenize(webtext.raw('wine.txt')))
>>> counts_from_wordtok.most_common(10)
[(u'.', 2824), (u',', 1550), (u'a', 821), (u'and', 786), (u'the', 706), (u'***', 608), (u'-', 518), (u'of', 482), (u'but', 474), (u'I', 390)]

>>> counts_from_words = Counter(webtext.words('wine.txt'))
>>> counts_from_words.most_common(10)
[(u'.', 2772), (u',', 1536), (u'-', 832), (u'a', 821), (u'and', 787), (u'the', 706), (u'***', 498), (u'of', 482), (u'but', 474), (u'I', 392)]


>>> len(word_tokenize(webtext.raw('wine.txt')))
31140
>>> len(webtext.words('wine.txt'))
31350

Something smells fishy...

Lets take a closer look of how webtext interface comes about, it uses the LazyCorpusLoader at https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L235

webtext = LazyCorpusLoader(
    'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')

If we look at how PlaintextCorpusReader is loading the corpus and tokenizing https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L41

class PlaintextCorpusReader(CorpusReader):
    CorpusView = StreamBackedCorpusView

    def __init__(self, root, fileids,
                 word_tokenizer=WordPunctTokenizer(),
                 sent_tokenizer=nltk.data.LazyLoader(
                     'tokenizers/punkt/english.pickle'),
                 para_block_reader=read_blankline_block,
                 encoding='utf8'):

Ah ha! It's using the `WordPunctTokenizer` instead of the default modified `TreebankTokenizer`

The WordPunctTokenizer is a simplistic tokenizer found at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/regexp.py#L171

The word_tokenize() function is a modified TreebankTokenizer unique to NLTK https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L97

If we look at what's webtext.words() calling, we follow https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L81

def words(self, fileids=None):
    """
    :return: the given file(s) as a list of words
        and punctuation symbols.
    :rtype: list(str)
    """
    return concat([self.CorpusView(path, self._read_word_block, encoding=enc)
                   for (path, enc, fileid)
                   in self.abspaths(fileids, True, True)])

to reach _read_word_block() at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119 :

def _read_word_block(self, stream):
    words = []
    for i in range(20): # Read 20 lines at a time.
        words.extend(self._word_tokenizer.tokenize(stream.readline()))
    return words

It's reading the file line by line!

So if we load the `webtext` corpus and use the `WordPunctTokenizer` we get the same number:

>>> from nltk.corpus import webtext
>>> from nltk.tokenize import WordPunctTokenizer
>>> wpt = WordPunctTokenizer()
>>> len(wpt.tokenize(webtext.raw('wine.txt')))
31350
>>> len(webtext.words('wine.txt'))
31350

More mysteries

You can also create a new webtext corpus object by specifying the tokenizer object e.g.

>>> from nltk.tokenize import _treebank_word_tokenizer
>>> from nltk.corpus import LazyCorpusLoader, PlaintextCorpusReader
>>> from nltk.corpus import webtext

# LazyCorpusLoader expects a tokenizer object,
# but word_tokenize() is a function, so we have to 
# import the tokenizer object that word_tokenize wraps around
>>> webtext2 = LazyCorpusLoader('webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2', word_tokenizer=_treebank_word_tokenizer)

>>> len(webtext2.words('wine.txt'))
28385

>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140


>>> list(webtext2.words('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine.', u'Polished', u'leather', u'and', u'strawberries.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now.', u'***', u'Liquorice', u',', u'cherry', u'fruit.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish.', u'**', u'Thin', u'and', u'completely', u'uninspiring.', u'*', u'Rough.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great', u'complexity.', u'A', u'good', u'***', u'Charming', u',', u'violet-fragranced', u'nose.']

>>> word_tokenize(webtext2.raw('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine', u'.', u'Polished', u'leather', u'and', u'strawberries', u'.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now', u'.', u'***', u'Liquorice', u',', u'cherry', u'fruit', u'.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish', u'.', u'**', u'Thin', u'and', u'completely', u'uninspiring', u'.', u'*', u'Rough', u'.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch', u'.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great']

That's because word_tokenize does a sent_tokenize before actually tokenizing sentences into words: https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L113

But the PlaintextCorpusReader. _read_word_block() doesn't do sent_tokenize beforehand, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119

Let's do a recount with sentence tokenization first:

>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140

>>> sum(len(tokenized_sent) for tokenized_sent in webtext2.sents('wine.txt'))
31140

Note: The sent_tokenizer of PlaintextCorpusReader uses the sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle') which is the same object shared with the nltk.sent_tokenize() function.

Voila!

Why is it that words() don't do sentence tokenization first?

I think it's because it was originally using the WordPunctTokenizer that doesn't need the string to be sentence tokenized first, whereas the TreebankWordTokenizer requires the string to be tokenized first.

Why is it that in the age of "deep learning" and "machine learning", we are still using regex based tokenizers and everything else in NLP are largely based on these tokens?

I have no ideas... But there are alternatives, e.g. http://gmb.let.rug.nl/elephant/about.php

NLTK words vs word_tokenize

Tags:

python

tokenize

nlp

nltk

corpus

Cassian Corey

People also ask

1 Answers

Something smells fishy...

Ah ha! It's using the `WordPunctTokenizer` instead of the default modified `TreebankTokenizer`

So if we load the `webtext` corpus and use the `WordPunctTokenizer` we get the same number:

More mysteries

Voila!

alvas

Recent Activity

Donate For Us

NLTK words vs word_tokenize

Tags:

python

tokenize

nlp

nltk

corpus

Cassian Corey

People also ask

1 Answers

Something smells fishy...

Ah ha! It's using the WordPunctTokenizer instead of the default modified TreebankTokenizer

So if we load the webtext corpus and use the WordPunctTokenizer we get the same number:

More mysteries

Voila!

alvas

Related questions

Recent Activity

Donate For Us

Ah ha! It's using the `WordPunctTokenizer` instead of the default modified `TreebankTokenizer`

So if we load the `webtext` corpus and use the `WordPunctTokenizer` we get the same number: