How do I download NLTK data?

Tags:

python

nltk

Updated answer:NLTK works for 2.7 well. I had 3.2. I uninstalled 3.2 and installed 2.7. Now it works!!

I have installed NLTK and tried to download NLTK Data. What I did was to follow the instrution on this site: http://www.nltk.org/data.html

I downloaded NLTK, installed it, and then tried to run the following code:

>>> import nltk >>> nltk.download()

It gave me the error message like below:

Traceback (most recent call last):   File "<pyshell#6>", line 1, in <module>     nltk.download() AttributeError: 'module' object has no attribute 'download'  Directory of C:\Python32\Lib\site-packages

Tried both nltk.download() and nltk.downloader(), both gave me error messages.

Then I used help(nltk) to pull out the package, it shows the following info:

NAME     nltk  PACKAGE CONTENTS     align     app (package)     book     ccg (package)     chat (package)     chunk (package)     classify (package)     cluster (package)     collocations     corpus (package)     data     decorators     downloader     draw (package)     examples (package)     featstruct     grammar     help     inference (package)     internals     lazyimport     metrics (package)     misc (package)     model (package)     parse (package)     probability     sem (package)     sourcedstring     stem (package)     tag (package)     test (package)     text     tokenize (package)     toolbox     tree     treetransforms     util     yamltags  FILE     c:\python32\lib\site-packages\nltk

I do see Downloader there, not sure why it does not work. Python 3.2.2, system Windows vista.

801

asked Mar 05 '14 23:03

Q-ximi

1 Answers

TL;DR

To download a particular dataset/models, use the nltk.download() function, e.g. if you are looking to download the punkt sentence tokenizer, use:

$ python3 >>> import nltk >>> nltk.download('punkt')

If you're unsure of which data/model you need, you can start out with the basic list of data + models with:

>>> import nltk >>> nltk.download('popular')

It will download a list of "popular" resources, these includes:

<collection id="popular" name="Popular packages">       <item ref="cmudict" />       <item ref="gazetteers" />       <item ref="genesis" />       <item ref="gutenberg" />       <item ref="inaugural" />       <item ref="movie_reviews" />       <item ref="names" />       <item ref="shakespeare" />       <item ref="stopwords" />       <item ref="treebank" />       <item ref="twitter_samples" />       <item ref="omw" />       <item ref="wordnet" />       <item ref="wordnet_ic" />       <item ref="words" />       <item ref="maxent_ne_chunker" />       <item ref="punkt" />       <item ref="snowball_data" />       <item ref="averaged_perceptron_tagger" />     </collection>

EDITED

In case anyone is avoiding errors from downloading larger datasets from nltk, from https://stackoverflow.com/a/38135306/610569

$ rm /Users/<your_username>/nltk_data/corpora/panlex_lite.zip $ rm -r /Users/<your_username>/nltk_data/corpora/panlex_lite $ python  >>> import nltk >>> dler = nltk.downloader.Downloader() >>> dler._update_index() >>> dler._status_cache['panlex_lite'] = 'installed' # Trick the index to treat panlex_lite as it's already installed. >>> dler.download('popular')

Updated

From v3.2.5, NLTK has a more informative error message when nltk_data resource is not found, e.g.:

>>> from nltk import word_tokenize >>> word_tokenize('x') Traceback (most recent call last):   File "<stdin>", line 1, in <module>   File "/Users/l/alvas/git/nltk/nltk/tokenize/__init__.py", line 128, in word_tokenize     sentences = [text] if preserve_line else sent_tokenize(text, language)   File "/Users//alvas/git/nltk/nltk/tokenize/__init__.py", line 94, in sent_tokenize     tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))   File "/Users/alvas/git/nltk/nltk/data.py", line 820, in load     opened_resource = _open(resource_url)   File "/Users/alvas/git/nltk/nltk/data.py", line 938, in _open     return find(path_, path + ['']).open()   File "/Users/alvas/git/nltk/nltk/data.py", line 659, in find     raise LookupError(resource_not_found) LookupError:  **********************************************************************   Resource punkt not found.   Please use the NLTK Downloader to obtain the resource:    >>> import nltk   >>> nltk.download('punkt')    Searched in:     - '/Users/alvas/nltk_data'     - '/usr/share/nltk_data'     - '/usr/local/share/nltk_data'     - '/usr/lib/nltk_data'     - '/usr/local/lib/nltk_data'     - '' **********************************************************************

To find nltk_data directory (auto-magically), see https://stackoverflow.com/a/36383314/610569
To download nltk_data to a different path, see https://stackoverflow.com/a/48634212/610569
To config nltk_data path (i.e. set a different path for NLTK to find nltk_data), see https://stackoverflow.com/a/22987374/610569

127

answered Oct 27 '22 18:10

alvas

Related questions
                            
                                IPython Notebook ipywidgets does not show
                            
                                .doc to pdf using python
                            
                                Python: Unable to Render Tex in Matplotlib
                            
                                Changing hostname in a url
                            
                                Using a variable while calling logger.setLevel
                            
                                How to adjust the quality of a resized image in Python Imaging Library?
                            
                                Upgrade python without breaking yum
                            
                                Check if a string is hexadecimal
                            
                                set difference for pandas
                            
                                pandas applying regex to replace values
                            
                                Find "home directory" in Python? [duplicate]
                            
                                Suppress "None" output as string in Jinja2
                            
                                Is it possible to add PyQt4/PySide packages on a Virtualenv sandbox?
                            
                                How can I insert data into a MySQL database?
                            
                                How to get column names from SQLAlchemy result (declarative syntax)
                            
                                Use index in pandas to plot data
                            
                                What's the best way to handle Django's objects.get?
                            
                                How can I access the current executing module or class name in Python?
                            
                                Python 'self' keyword
                            
                                Parse HTML table to Python list?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With