I'm exploring some of NLTK's corpora and came across the following behaviour: word_tokenize() and words produce different sets of words().
Here is an example using webtext:
from nltk.corpus import webtext
When I run the following,
len(set(word_tokenize(webtext.raw('wine.txt'))))
I get: 3488
When I run the following,
len(set(webtext.words('wine.txt')))
I get: 3414
All I can find in the documentation is that word_tokenize is a list of punctuation and words. But it also says words is a list of punctuation and words. I'm wondering, what's going on here? Why are they different?
I've already tried looking at the set differences.
U = set(word_tokenize(webtext.raw('wine.txt')))
V = set(webtext.words('wine.txt'))
tok_not_in_words = U.difference(V) # in tokenize but not in words
words_not_in_tok = V.difference(U) # in words but not in tokenize
All I can see is that word_tokenize contains hyphenated words and words splits the hyphenated words.
Any help is appreciated. Thank you!
word_tokenize is a function in Python that splits a given sentence into words using the NLTK library. Figure 1 below shows the tokenization of sentence into words. Figure 1: Splitting of a sentence into words. In Python, we can tokenize with the help of the Natural Language Toolkit ( NLTK ) library.
Tokenization can be done to either separate words or sentences. If the text is split into words using some separation technique it is called word tokenization and same separation done for sentences is called sentence tokenization.
Whitespace tokenization This is the most simple and commonly used form of tokenization. It splits the text whenever it finds whitespace characters. It is advantageous since it is a quick and easily understood method of tokenization. However, due to its simplicity, it does not take special cases into account.
First lets take a look at the count the tokens from both approach and see the most_common
words:
>>> import nltk
>>> from nltk import word_tokenize
>>> from nltk.corpus import webtext
>>> counts_from_wordtok = Counter(word_tokenize(webtext.raw('wine.txt')))
>>> counts_from_wordtok.most_common(10)
[(u'.', 2824), (u',', 1550), (u'a', 821), (u'and', 786), (u'the', 706), (u'***', 608), (u'-', 518), (u'of', 482), (u'but', 474), (u'I', 390)]
>>> counts_from_words = Counter(webtext.words('wine.txt'))
>>> counts_from_words.most_common(10)
[(u'.', 2772), (u',', 1536), (u'-', 832), (u'a', 821), (u'and', 787), (u'the', 706), (u'***', 498), (u'of', 482), (u'but', 474), (u'I', 392)]
>>> len(word_tokenize(webtext.raw('wine.txt')))
31140
>>> len(webtext.words('wine.txt'))
31350
Lets take a closer look of how webtext
interface comes about, it uses the LazyCorpusLoader
at https://github.com/nltk/nltk/blob/develop/nltk/corpus/init.py#L235
webtext = LazyCorpusLoader(
'webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2')
If we look at how PlaintextCorpusReader
is loading the corpus and tokenizing https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L41
class PlaintextCorpusReader(CorpusReader):
CorpusView = StreamBackedCorpusView
def __init__(self, root, fileids,
word_tokenizer=WordPunctTokenizer(),
sent_tokenizer=nltk.data.LazyLoader(
'tokenizers/punkt/english.pickle'),
para_block_reader=read_blankline_block,
encoding='utf8'):
WordPunctTokenizer
instead of the default modified TreebankTokenizer
The WordPunctTokenizer
is a simplistic tokenizer found at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/regexp.py#L171
The word_tokenize()
function is a modified TreebankTokenizer
unique to NLTK https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L97
If we look at what's webtext.words()
calling, we follow https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L81
def words(self, fileids=None):
"""
:return: the given file(s) as a list of words
and punctuation symbols.
:rtype: list(str)
"""
return concat([self.CorpusView(path, self._read_word_block, encoding=enc)
for (path, enc, fileid)
in self.abspaths(fileids, True, True)])
to reach _read_word_block()
at https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119 :
def _read_word_block(self, stream):
words = []
for i in range(20): # Read 20 lines at a time.
words.extend(self._word_tokenizer.tokenize(stream.readline()))
return words
It's reading the file line by line!
webtext
corpus and use the WordPunctTokenizer
we get the same number:>>> from nltk.corpus import webtext
>>> from nltk.tokenize import WordPunctTokenizer
>>> wpt = WordPunctTokenizer()
>>> len(wpt.tokenize(webtext.raw('wine.txt')))
31350
>>> len(webtext.words('wine.txt'))
31350
You can also create a new webtext
corpus object by specifying the tokenizer object e.g.
>>> from nltk.tokenize import _treebank_word_tokenizer
>>> from nltk.corpus import LazyCorpusLoader, PlaintextCorpusReader
>>> from nltk.corpus import webtext
# LazyCorpusLoader expects a tokenizer object,
# but word_tokenize() is a function, so we have to
# import the tokenizer object that word_tokenize wraps around
>>> webtext2 = LazyCorpusLoader('webtext', PlaintextCorpusReader, r'(?!README|\.).*\.txt', encoding='ISO-8859-2', word_tokenizer=_treebank_word_tokenizer)
>>> len(webtext2.words('wine.txt'))
28385
>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140
>>> list(webtext2.words('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine.', u'Polished', u'leather', u'and', u'strawberries.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now.', u'***', u'Liquorice', u',', u'cherry', u'fruit.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish.', u'**', u'Thin', u'and', u'completely', u'uninspiring.', u'*', u'Rough.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great', u'complexity.', u'A', u'good', u'***', u'Charming', u',', u'violet-fragranced', u'nose.']
>>> word_tokenize(webtext2.raw('wine.txt'))[:100]
[u'Lovely', u'delicate', u',', u'fragrant', u'Rhone', u'wine', u'.', u'Polished', u'leather', u'and', u'strawberries', u'.', u'Perhaps', u'a', u'bit', u'dilute', u',', u'but', u'good', u'for', u'drinking', u'now', u'.', u'***', u'Liquorice', u',', u'cherry', u'fruit', u'.', u'Simple', u'and', u'coarse', u'at', u'the', u'finish', u'.', u'**', u'Thin', u'and', u'completely', u'uninspiring', u'.', u'*', u'Rough', u'.', u'No', u'Stars', u'Big', u',', u'fat', u',', u'textured', u'Chardonnay', u'-', u'nuts', u'and', u'butterscotch', u'.', u'A', u'slightly', u'odd', u'metallic/cardboard', u'finish', u',', u'but', u'probably', u'***', u'A', u'blind', u'tasting', u',', u'other', u'than', u'the', u'fizz', u',', u'which', u'included', u'five', u'vintages', u'of', u'Cote', u'Rotie', u'Brune', u'et', u'Blonde', u'from', u'Guigal', u'.', u'Surprisingly', u'young', u'feeling', u'and', u'drinking', u'well', u',', u'but', u'without', u'any', u'great']
That's because word_tokenize
does a sent_tokenize
before actually tokenizing sentences into words: https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L113
But the PlaintextCorpusReader. _read_word_block()
doesn't do sent_tokenize
beforehand, https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/plaintext.py#L119
Let's do a recount with sentence tokenization first:
>>> len(word_tokenize(webtext2.raw('wine.txt')))
31140
>>> sum(len(tokenized_sent) for tokenized_sent in webtext2.sents('wine.txt'))
31140
Note: The sent_tokenizer
of PlaintextCorpusReader
uses the sent_tokenizer=nltk.data.LazyLoader('tokenizers/punkt/english.pickle')
which is the same object shared with the nltk.sent_tokenize()
function.
Why is it that words()
don't do sentence tokenization first?
I think it's because it was originally using the WordPunctTokenizer
that doesn't need the string to be sentence tokenized first, whereas the TreebankWordTokenizer
requires the string to be tokenized first.
Why is it that in the age of "deep learning" and "machine learning", we are still using regex based tokenizers and everything else in NLP are largely based on these tokens?
I have no ideas... But there are alternatives, e.g. http://gmb.let.rug.nl/elephant/about.php
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With