Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What to download in order to make nltk.tokenize.word_tokenize work?

Tags:

python

nltk

I am going to use nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB.

This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize? So far, I've seen nltk.download('punkt') but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?

like image 606
petrbel Avatar asked May 08 '16 14:05

petrbel


People also ask

What does nltk download (' Punkt ') do?

tokenize. punkt module. This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences.

What is nltk download (' Wordnet ')?

The argument to nltk. download() is not a file or module, but a resource id that maps to a corpus, machine-learning model or other resource (or collection of resources) to be installed in your NLTK_DATA area. You can see a list of the available resources, and their IDs, at http://www.nltk.org/nltk_data/ .

How do I download nltk packages?

Download individual packages from https://www.nltk.org/nltk_data/ (see the “download” links). Unzip them to the appropriate subfolder. For example, the Brown Corpus, found at: https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/brown.zip is to be unzipped to nltk_data/corpora/brown .


3 Answers

You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt') should do the trick.

like image 161
Tulio Casagrande Avatar answered Oct 22 '22 05:10

Tulio Casagrande


In short:

nltk.download('punkt')

would suffice.


In long:

You don't necessary need to download all the models and corpora available in NLTk if you're just going to use NLTK for tokenization.

Actually, if you're just using word_tokenize(), then you won't really need any of the resources from nltk.download(). If we look at the code, the default word_tokenize() that is basically the TreebankWordTokenizer shouldn't use any additional resources:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('This is a sentence.')
['This', 'is', 'a', 'sentence', '.']

But:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import sent_tokenize
>>> sent_tokenize('This is a sentence. This is another.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

>>> from nltk import word_tokenize
>>> word_tokenize('This is a sentence.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

But it looks like that's not the case, if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. It seems like word_tokenize has implicitly called sent_tokenize() which requires the punkt model.

I am not sure whether this is a bug or a feature but it seems like the old idiom might be outdated given the current code:

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is a foo bar sentence. This is another sentence.'
>>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
>>> tokenized_sents
[['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]

It can simply be:

>>> word_tokenize(sentences)
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

But we see that the word_tokenize() flattens the list of list of string to a single list of string.


Alternatively, you can try to use a new tokenizer that was added to NLTK toktok.py based on https://github.com/jonsafari/tok-tok that requires no pre-trained models.

like image 28
alvas Avatar answered Oct 22 '22 04:10

alvas


If you have huge NLTK pickles in lambda, the code editor won't be available to edit. Use Lambda layers. You may just upload the NLTK data and include the data in the code like below.

nltk.data.path.append("/opt/tmp_nltk")
like image 1
Hari krish Avatar answered Oct 22 '22 04:10

Hari krish