What to download in order to make nltk.tokenize.word_tokenize work?

Tags:

I am going to use nltk.tokenize.word_tokenize on a cluster where my account is very limited by space quota. At home, I downloaded all nltk resources by nltk.download() but, as I found out, it takes ~2.5GB.

This seems a bit overkill to me. Could you suggest what are the minimal (or almost minimal) dependencies for nltk.tokenize.word_tokenize? So far, I've seen nltk.download('punkt') but I am not sure whether it is sufficient and what is the size. What exactly should I run in order to make it work?

606

asked May 08 '16 14:05

petrbel

3 Answers

You are right. You need Punkt Tokenizer Models. It has 13 MB and nltk.download('punkt') should do the trick.

161

answered Oct 22 '22 05:10

Tulio Casagrande

In short:

nltk.download('punkt')

would suffice.

In long:

You don't necessary need to download all the models and corpora available in NLTk if you're just going to use NLTK for tokenization.

Actually, if you're just using word_tokenize(), then you won't really need any of the resources from nltk.download(). If we look at the code, the default word_tokenize() that is basically the TreebankWordTokenizer shouldn't use any additional resources:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data/
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import word_tokenize
>>> from nltk.tokenize import TreebankWordTokenizer
>>> tokenizer = TreebankWordTokenizer()
>>> tokenizer.tokenize('This is a sentence.')
['This', 'is', 'a', 'sentence', '.']

But:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mv nltk_data/ tmp_move_nltk_data
alvas@ubi:~$ python
Python 2.7.11+ (default, Apr 17 2016, 14:00:29) 
[GCC 5.3.1 20160413] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from nltk import sent_tokenize
>>> sent_tokenize('This is a sentence. This is another.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

>>> from nltk import word_tokenize
>>> word_tokenize('This is a sentence.')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 106, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/usr/local/lib/python2.7/dist-packages/nltk/tokenize/__init__.py", line 90, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 801, in load
    opened_resource = _open(resource_url)
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 919, in _open
    return find(path_, path + ['']).open()
  File "/usr/local/lib/python2.7/dist-packages/nltk/data.py", line 641, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/alvas/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - u''
**********************************************************************

But it looks like that's not the case, if we look at https://github.com/nltk/nltk/blob/develop/nltk/tokenize/init.py#L93. It seems like word_tokenize has implicitly called sent_tokenize() which requires the punkt model.

I am not sure whether this is a bug or a feature but it seems like the old idiom might be outdated given the current code:

>>> from nltk import sent_tokenize, word_tokenize
>>> sentences = 'This is a foo bar sentence. This is another sentence.'
>>> tokenized_sents = [word_tokenize(sent) for sent in sent_tokenize(sentences)]
>>> tokenized_sents
[['This', 'is', 'a', 'foo', 'bar', 'sentence', '.'], ['This', 'is', 'another', 'sentence', '.']]

It can simply be:

>>> word_tokenize(sentences)
['This', 'is', 'a', 'foo', 'bar', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

But we see that the word_tokenize() flattens the list of list of string to a single list of string.

Alternatively, you can try to use a new tokenizer that was added to NLTK toktok.py based on https://github.com/jonsafari/tok-tok that requires no pre-trained models.

answered Oct 22 '22 04:10

alvas

If you have huge NLTK pickles in lambda, the code editor won't be available to edit. Use Lambda layers. You may just upload the NLTK data and include the data in the code like below.

nltk.data.path.append("/opt/tmp_nltk")

answered Oct 22 '22 04:10

Hari krish

Related questions
                            
                                AttributeError: 'Series' object has no attribute 'days'
                            
                                Plotly: Plot multiple figures as subplots
                            
                                How to remove or hide x-axis labels from a seaborn / matplotlib plot
                            
                                PHP equivalent to Python's yield operator
                            
                                Using cscope to browse Python code with VIM?
                            
                                image information along a polar coordinate system
                            
                                Creating or assigning variables from a dictionary
                            
                                Python - Convert currency code to its sign
                            
                                Sorting dictionary using operator.itemgetter
                            
                                Compare two CSV files and search for similar items
                            
                                extracting element and insert a space
                            
                                Tumblr API 2: Where is the "OAUTH_TOKEN" and "OAUTH_TOKEN_SECRET"
                            
                                CSRF verification failed. Request aborted
                            
                                Pixel neighbors in 2d array (image) using Python
                            
                                Expand Text widget to fill the entire parent Frame in Tkinter
                            
                                How to identify the subject of a sentence?
                            
                                Plot NetworkX Graph from Adjacency Matrix in CSV file
                            
                                Eclipse, PyDev "Project interpreter not specified”
                            
                                Retrieving Data from MySQL in batches via Python
                            
                                All possible ways to interleave two strings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What to download in order to make nltk.tokenize.word_tokenize work?

Tags:

python

nltk