How do I download NLTK data?




Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!!

I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution on this site: http://www.nltk.org/data.html

I downloaded NLTK, installed it, and then tried to run the following code:

>>> import nltk >>> nltk.download() 

It gave me the error message like below:

Traceback (most recent call last):   File "<pyshell#6>", line 1, in <module>     nltk.download() AttributeError: 'module' object has no attribute 'download'  Directory of C:\Python32\Lib\site-packages 

Tried both nltk.download() and nltk.downloader(), both gave me error messages.

Then I used help(nltk) to pull out the package, it shows the following info:

NAME     nltk  PACKAGE CONTENTS     align     app (package)     book     ccg (package)     chat (package)     chunk (package)     classify (package)     cluster (package)     collocations     corpus (package)     data     decorators     downloader     draw (package)     examples (package)     featstruct     grammar     help     inference (package)     internals     lazyimport     metrics (package)     misc (package)     model (package)     parse (package)     probability     sem (package)     sourcedstring     stem (package)     tag (package)     test (package)     text     tokenize (package)     toolbox     tree     treetransforms     util     yamltags  FILE     c:\python32\lib\site-packages\nltk 

I do see Downloader there, not sure why it does not work. Python 3.2.2, system Windows vista.

1 Answers


To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use:

$ python3 >>> import nltk >>> nltk.download('punkt') 

If you're unsure of which data/model you need, you can start out with the basic list of data + models with:

>>> import nltk >>> nltk.download('popular') 

It will download a list of "popular" resources, these includes:

<collection id="popular" name="Popular packages">       <item ref="cmudict" />       <item ref="gazetteers" />       <item ref="genesis" />       <item ref="gutenberg" />       <item ref="inaugural" />       <item ref="movie_reviews" />       <item ref="names" />       <item ref="shakespeare" />       <item ref="stopwords" />       <item ref="treebank" />       <item ref="twitter_samples" />       <item ref="omw" />       <item ref="wordnet" />       <item ref="wordnet_ic" />       <item ref="words" />       <item ref="maxent_ne_chunker" />       <item ref="punkt" />       <item ref="snowball_data" />       <item ref="averaged_perceptron_tagger" />     </collection> 


In case anyone is avoiding errors from downloading larger datasets from nltk, from https://stackoverflow.com/a/38135306/610569

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip $ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite $ python  >>> import nltk >>> dler = nltk.downloader.Downloader() >>> dler._update_index() >>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed. >>> dler.download('popular') 


From v3.2.5, NLTK has a more informative error message when nltk_data resource is not found, e.g.:

>>> from nltk import word_tokenize >>> word_tokenize('x') Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize     sentences = [text] if preserve_line else sent_tokenize(text, language)   File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))   File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load     opened_resource = _open(resource_url)   File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open     return find(path_, path + ['']).open()   File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find     raise LookupError(resource_not_found) LookupError:  **********************************************************************   Resource punkt not found.   Please use the NLTK Downloader to obtain the resource:    >>> import nltk   >>> nltk.download('punkt')    Searched in:     - '/Users/alvas/nltk_data'     - '/usr/share/nltk_data'     - '/usr/local/share/nltk_data'     - '/usr/lib/nltk_data'     - '/usr/local/lib/nltk_data'     - '' ********************************************************************** 


