Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I download NLTK data?

Tags:

python

nltk

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!!

I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution on this site: http://www.nltk.org/data.html

I downloaded NLTK, installed it, and then tried to run the following code:

>>> import nltk >>> nltk.download() 

It gave me the error message like below:

Traceback (most recent call last):   File "<pyshell#6>", line 1, in <module>     nltk.download() AttributeError: 'module' object has no attribute 'download'  Directory of C:\Python32\Lib\site-packages 

Tried both nltk.download() and nltk.downloader(), both gave me error messages.

Then I used help(nltk) to pull out the package, it shows the following info:

NAME     nltk  PACKAGE CONTENTS     align     app (package)     book     ccg (package)     chat (package)     chunk (package)     classify (package)     cluster (package)     collocations     corpus (package)     data     decorators     downloader     draw (package)     examples (package)     featstruct     grammar     help     inference (package)     internals     lazyimport     metrics (package)     misc (package)     model (package)     parse (package)     probability     sem (package)     sourcedstring     stem (package)     tag (package)     test (package)     text     tokenize (package)     toolbox     tree     treetransforms     util     yamltags  FILE     c:\python32\lib\site-packages\nltk 

I do see Downloader there, not sure why it does not work. Python 3.2.2, system Windows vista.

like image 801
Q-ximi Avatar asked Mar 05 '14 23:03

Q-ximi


People also ask

How do I download NLTK files offline?

This can be done easily on Linux using SSH. For windows, we have something similar, called PsExec. Step 1: Download PsExec First download the program at https://docs.microsoft.com/en-us/sysinternals/downloads/psexec. Step 2: Grant access for remote execution Just in case you see “Access is Denied” when…

What is NLTK download (' Wordnet ')?

The argument to nltk. download() is not a file or module, but a resource id that maps to a corpus, machine-learning model or other resource (or collection of resources) to be installed in your NLTK_DATA area. You can see a list of the available resources, and their IDs, at http://www.nltk.org/nltk_data/ .

How do I download NLTK files in Linux?

Install Numpy (optional): run sudo pip install -U numpy. Install NLTK: run sudo pip install -U nltk. Test installation: run python then type import nltk.


1 Answers

TL;DR

To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use:

$ python3 >>> import nltk >>> nltk.download('punkt') 

If you're unsure of which data/model you need, you can start out with the basic list of data + models with:

>>> import nltk >>> nltk.download('popular') 

It will download a list of "popular" resources, these includes:

<collection id="popular" name="Popular packages">       <item ref="cmudict" />       <item ref="gazetteers" />       <item ref="genesis" />       <item ref="gutenberg" />       <item ref="inaugural" />       <item ref="movie_reviews" />       <item ref="names" />       <item ref="shakespeare" />       <item ref="stopwords" />       <item ref="treebank" />       <item ref="twitter_samples" />       <item ref="omw" />       <item ref="wordnet" />       <item ref="wordnet_ic" />       <item ref="words" />       <item ref="maxent_ne_chunker" />       <item ref="punkt" />       <item ref="snowball_data" />       <item ref="averaged_perceptron_tagger" />     </collection> 

EDITED

In case anyone is avoiding errors from downloading larger datasets from nltk, from https://stackoverflow.com/a/38135306/610569

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip $ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite $ python  >>> import nltk >>> dler = nltk.downloader.Downloader() >>> dler._update_index() >>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed. >>> dler.download('popular') 

Updated

From v3.2.5, NLTK has a more informative error message when nltk_data resource is not found, e.g.:

>>> from nltk import word_tokenize >>> word_tokenize('x') Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize     sentences = [text] if preserve_line else sent_tokenize(text, language)   File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))   File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load     opened_resource = _open(resource_url)   File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open     return find(path_, path + ['']).open()   File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find     raise LookupError(resource_not_found) LookupError:  **********************************************************************   Resource punkt not found.   Please use the NLTK Downloader to obtain the resource:    >>> import nltk   >>> nltk.download('punkt')    Searched in:     - '/Users/alvas/nltk_data'     - '/usr/share/nltk_data'     - '/usr/local/share/nltk_data'     - '/usr/lib/nltk_data'     - '/usr/local/lib/nltk_data'     - '' ********************************************************************** 

Related

  • To find nltk_data directory (auto-magically), see https://stackoverflow.com/a/36383314/610569

  • To download nltk_data to a different path, see https://stackoverflow.com/a/48634212/610569

  • To config nltk_data path (i.e. set a different path for NLTK to find nltk_data), see https://stackoverflow.com/a/22987374/610569

like image 127
alvas Avatar answered Oct 27 '22 18:10

alvas