I am going to use nltk.tokenize.word_tokenize
on a cluster where my account is very limited by space quota. At home, I downloaded all nltk
resources by nltk.download()
but, as I found out, it takes ~2.5GB.
This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize
? So far, I've seen nltk.download('punkt')
but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?
tokenize. punkt module. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.
The argument to nltk. download() is not a file or module, but a resource id that maps to a corpus, machine-learning model or other resource (or collection of resources) to be installed in your NLTK_DATA area. You can see a list of the available resources, and their IDs, at http://www.nltk.org/nltk_data/ .
Download individual packages from https://www.nltk.org/nltk_data/ (see the “download” links). Unzip them to the appropriate subfolder. For example, the Brown Corpus, found at: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip is to be unzipped to nltk_data/corpora/brown .
You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt')
should do the trick.
In short:
nltk.download('punkt')
would suffice.
In long:
You don't necessary need to download all the models and corpora available in NLTk if you're just going to use NLTK
for tokenization.
Actually, if you're just using word_tokenize()
, then you won't really need any of the resources from nltk.download()
. If we look at the code, the default word_tokenize()
that is basically the TreebankWordTokenizer shouldn't use any additional resources:
alvas@ubi:~$ ls nltk_data/
chunkers corpora grammars help models stemmers taggers tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('This is a sentence.')
['This', 'is', 'a', 'sentence', '.']
But:
alvas@ubi:~$ ls nltk_data/
chunkers corpora grammars help models stemmers taggers tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29)
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import sent_tokenize
>>> sent_tokenize('This is a sentence. This is another.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
return find(path_, path + ['']).open()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/alvas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************
>>> from nltk import word_tokenize
>>> word_tokenize('This is a sentence.')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
return [token for sent in sent_tokenize(text, language)
File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
opened_resource = _open(resource_url)
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
return find(path_, path + ['']).open()
File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found. Please
use the NLTK Downloader to obtain the resource: >>>
nltk.download()
Searched in:
- '/home/alvas/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''
**********************************************************************
But it looks like that's not the case, if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. It seems like word_tokenize
has implicitly called sent_tokenize()
which requires the punkt
model.
I am not sure whether this is a bug or a feature but it seems like the old idiom might be outdated given the current code:
>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is a foo bar sentence. This is another sentence.'
>>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
>>> tokenized_sents
[['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]
It can simply be:
>>> word_tokenize(sentences)
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']
But we see that the word_tokenize()
flattens the list of list of string to a single list of string.
Alternatively, you can try to use a new tokenizer that was added to NLTK toktok.py
based on https://github.com/jonsafari/tok-tok that requires no pre-trained models.
If you have huge NLTK pickles in lambda, the code editor won't be available to edit. Use Lambda layers. You may just upload the NLTK data and include the data in the code like below.
nltk.data.path.append("/opt/tmp_nltk")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With