Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Change nltk.download() path directory from default ~/ntlk_data

Tags:

I was trying to download/update python nltk packages on a computing server and it returned this [Errno 122] Disk quota exceeded: error.

Specifically:

[nltk_data] Downloading package stop words to /home/sh2264/nltk_data... [nltk_data] Error downloading u'stopwords' from [nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh- [nltk_data] pages/packages/corpora/stopwords.zip>: [Errno 122] [nltk_data] Disk quota exceeded: [nltk_data] u'/home/sh2264/nltk_data/corpora/stopwords.zip False 

How could I change the entire path for nltk packages, and what other changes should I make to ensure errorless loading of nltk?

like image 907
shenglih Avatar asked Jul 01 '17 04:07

shenglih


People also ask

What does NLTK download () do?

downloader module. The NLTK corpus and module downloader. This module defines several interfaces which can be used to download corpora, models, and other data packages that can be used with NLTK.

What is NLTK download (' Wordnet ')?

The argument to nltk. download() is not a file or module, but a resource id that maps to a corpus, machine-learning model or other resource (or collection of resources) to be installed in your NLTK_DATA area. You can see a list of the available resources, and their IDs, at http://www.nltk.org/nltk_data/ .

Where is NLTK data stored?

It depends on where you set the destination folder when you download the data using nltk. download(). On Windows 10, the default destination is either C:\Users\narae\nltk_data or C:\Users\narae\AppData\Roaming\nltk_data, but you can specify a different directory before downloading.


2 Answers

This can be configured both by command-line (nltk.download(..., download_dir=) or by GUI. Bizarrely nltk seems to totally ignore its own environment variable NLTK_DATA and default its download directories to a standard set of five paths, regardless whether NLTK_DATA is defined and where it points, and regardless whether nltk's five default dirs even exist on the machine or architecture(!). Some of that is documented in Installing NLTK Data, although it's incomplete and kinda buried; reproduced below with much clearer formatting:

Command line installation

The downloader will search for an existing nltk_data directory to install NLTK data. If one does not exist it will attempt to create one in a central location (when using an administrator account) or otherwise in the user’s filespace. If necessary, run the download command from an administrator account, or using sudo. The recommended system location is:

  • C:\nltk_data (Windows) ;
  • /usr/local/share/nltk_data (Mac) and
  • /usr/share/nltk_data (Unix).

You can use the -d flag to specify a different location (but if you do this, be sure to set the NLTK_DATA environment variable accordingly).

  • Run the command python -m nltk.downloader all

  • To ensure central installation, run the command: sudo python -m nltk.downloader -d /usr/local/share/nltk_data all

  • But really they should say: sudo python -m nltk.downloader -d $NLTK_DATA all

Now as to what recommended path NLTK_DATA should use, nltk doesn't really give any proper guidance, but it should be a generic standalone path not under any install tree (so not under <python-install-directory>/lib/site-packages) or any user dir. Hence, /usr/local/share, /opt/share or similar. On MacOS 10.7+, /usr and thus /usr/local/ these days are hidden by default, so /opt/share may well be a better choice. Or do chflags nohidden /usr/local/share.

like image 108
smci Avatar answered Oct 05 '22 23:10

smci


According to the documentation:

By default, packages are installed in either a system-wide directory (if Python has sufficient access to write to it); or in the current user’s home directory. However, the download_dir argument may be used to specify a different installation target, if desired.

To specify the download directory, use for example:

nltk.download('treebank', download_dir='/mnt/data/treebank') 
like image 26
Ortomala Lokni Avatar answered Oct 06 '22 01:10

Ortomala Lokni