Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Resource 'corpora/wordnet' not found on Heroku

I'm trying to get NLTK and wordnet working on Heroku. I've already done

heroku run python
nltk.download()
  wordnet
pip install -r requirements.txt

But I get this error:

Resource 'corpora/wordnet' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/app/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'

Yet, I've looked at in /app/nltk_data and it's there, so I'm not sure what's going on.

like image 777
user1881006 Avatar asked Dec 20 '12 05:12

user1881006


5 Answers

I just had this same problem. What ended up working for me is creating an 'nltk_data' directory in the application's folder itself, downloading the corpus to that directory and adding a line to my code that lets the nltk know to look in that directory. You can do this all locally and then push the changes to Heroku.

So, supposing my python application is in a directory called "myapp/"

Step 1: Create the directory

cd myapp/
mkdir nltk_data

Step 2: Download Corpus to New Directory

python -m nltk.downloader

This'll pop up the nltk downloader. Set your Download Directory to whatever_the_absolute_path_to_myapp_is/nltk_data/. If you're using the GUI downloader, the download directory is set through a text field on the bottom of the UI. If you're using the command line one, you set it in the config menu.

Once the downloader knows to point to your newly created nltk_data directory, download your corpus.

Or in one step from Python code:

nltk.download("wordnet", "whatever_the_absolute_path_to_myapp_is/nltk_data/")

Step 3: Let nltk Know Where to Look

ntlk looks for data,resources,etc. in the locations specified in the nltk.data.path variable. All you need to do is add nltk.data.path.append('./nltk_data/') to the python file actually using nltk, and it will look for corpora, tokenizers, and such in there in addition to the default paths.

Step 4: Send it to Heroku

git add nltk_data/
git commit -m 'super useful commit message'
git push heroku master

That should work! It did for me anyway. One thing worth noting is that the path from the python file executing nltk stuff to the nltk_data directory may be different depending on how you've structured your application, so just account for that when you do nltk.data.path.append('path_to_nltk_data')

like image 69
follyroof Avatar answered Nov 20 '22 18:11

follyroof


Update

As Kenneth Reitz pointed out, a much simpler solution has been added to the heroku-python-buildpack. Add a nltk.txt file to your root directory and list your corpora inside. See https://devcenter.heroku.com/articles/python-nltk for details.


Original Answer

Here's a cleaner solution that allows you to install the NLTK data directly on Heroku without adding it to your git repo.

I used similar steps to install Textblob on Heroku, which uses NLTK as a dependency. I've made some minor adjustments to my original code in steps 3 and 4 that should work for an NLTK only installation.

The default heroku buildpack includes a post_compile step that runs after all of the default build steps have been completed:

# post_compile
#!/usr/bin/env bash

if [ -f bin/post_compile ]; then
    echo "-----> Running post-compile hook"
    chmod +x bin/post_compile
    sub-env bin/post_compile
fi

As you can see, it looks in your project directory for your own post_compile file in the bin directory, and it runs it if it exists. You can use this hook to install the nltk data.

  1. Create the bin directory in the root of your local project.

  2. Add your own post_compile file to the bin directory.

    # bin/post_compile
    #!/usr/bin/env bash
    
    if [ -f bin/install_nltk_data ]; then
        echo "-----> Running install_nltk_data"
        chmod +x bin/install_nltk_data
        bin/install_nltk_data
    fi
    
    echo "-----> Post-compile done"
    
  3. Add your own install_nltk_data file to the bin directory.

    # bin/install_nltk_data
    #!/usr/bin/env bash
    
    source $BIN_DIR/utils
    
    echo "-----> Starting nltk data installation"
    
    # Assumes NLTK_DATA environment variable is already set
    # $ heroku config:set NLTK_DATA='/app/nltk_data'
    
    # Install the nltk data
    # NOTE: The following command installs the wordnet corpora, 
    # so you may want to change for your specific needs.  
    # See http://www.nltk.org/data.html
    python -m nltk.downloader wordnet
    
    # If using Textblob, use this instead:
    # python -m textblob.download_corpora lite
    
    # Open the NLTK_DATA directory
    cd ${NLTK_DATA}
    
    # Delete all of the zip files
    find . -name "*.zip" -type f -delete
    
    echo "-----> Finished nltk data installation"
    
  4. Add nltk to your requirements.txt file (Or textblob if you are using Textblob).

  5. Commit all of these changes to your repo.

  6. Set the NLTK_DATA environment variable on your heroku app.

    $ heroku config:set NLTK_DATA='/app/nltk_data'
    
  7. Deploy to Heroku. You will see the post_compile step trigger at the end of the deployment, followed by the nltk download.

I hope you found this helpful! Enjoy!

like image 33
Michael Godshall Avatar answered Nov 20 '22 18:11

Michael Godshall


For Mac OS user only.

python -m nltk.downloader -d /usr/share/nltk_data wordnet

the corpora data can't be downloaded directly to the /usr/share/nltk_data folder. error reports "no permission", two solutions:

  1. Add additional permission change to the Mac system, details refer to Operation Not Permitted when on root El capitan (rootless disabled) . However, I don't want to change to mac default setting just for this corpora. and I go for the second solution.

    • Download the corpora to any directory you have the access to. `python -m nltk.downloader -d some_user_accessable_directory wordnet'. Noted, there you only download the required corpora, e.g., wordnet, reuters instead of the whole corpora from nltk.
    • Add path to nltk path. In py file, add following lines:

      import nltk nltk.data.path.append('nltk_data')

like image 5
HappyCoding Avatar answered Nov 20 '22 18:11

HappyCoding


I was getting this issue. For those who are not working in virtual environment, will need to download to following directory in ubuntu:

/usr/share/nltk_data/corpora/wordnet

Instead of wordnet it could be brown or whatever. You can directly run this command in your terminal if you want to download the corpus.

$ sudo python -m nltk.downloader -d /usr/share/nltk_data wordnet

Again instead of wordnet it could be brown.

like image 2
Gaurav Anand Avatar answered Nov 20 '22 16:11

Gaurav Anand


This one works:

For Mac OS users.

python -m nltk.downloader -d /usr/local/share/nltk_data wordnet
like image 2
Thiago Avatar answered Nov 20 '22 18:11

Thiago