LookupError: Resource 'corpora/stopwords' not found

Question

I am trying to run a webapp on Heroku using Flask. The webapp is programmed in Python with the NLTK (Natural Language Toolkit library).

One of the file has the following header:

import nltk, json, operator
from nltk.corpus import stopwords 
from nltk.tokenize import RegexpTokenizer

When the webpage with the stopwords code is called, it produces the following error:

LookupError: 
**********************************************************************
  Resource 'corpora/stopwords' not found.  Please use the NLTK  
  Downloader to obtain the resource:  >>> nltk.download()  
  Searched in:  
    - '/app/nltk_data'  
    - '/usr/share/nltk_data'  
    - '/usr/local/share/nltk_data'  
    - '/usr/lib/nltk_data'  
    - '/usr/local/lib/nltk_data'  
**********************************************************************

The exact code used:

#remove punctuation  
toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True) 
data = toker.tokenize(data)  

#remove stop words and digits 
stopword = stopwords.words('english')  
data = [w for w in data if w not in stopword and not w.isdigit()]

The webapp on Heroku doesn't produce the Lookup error when stopword = stopwords.words('english') is commented out.

The code runs without a glitch on my local computer. I have have installed the required libraries on my computer using

pip install requirements.txt

The virtual environment provided by Heroku was running when I tested the code on my computer.

I have also tried the NLTK provided by two different sources, but the LookupError is still there. The two sources I used are:
http://pypi.python.org/packages/source/n/nltk/nltk-2.0.1rc4.zip
https://github.com/nltk/nltk.git

Gaurang · Accepted Answer

The problem is that the corpus ('stopwords' in this case) doesn't get uploaded to Heroku. Your code works on your local machine because it already has the NLTK corpus. Please follow these steps to solve the issue.

Create a new directory in your project (let's call it 'nltk_data')
Download the NLTK corpus in that directory. You will have to configure that during the download.
Tell nltk to look for this particular path. Just add nltk.data.path.append('path_to_nltk_data') to the Python file that's actually using nltk.
Now push the app to Heroku.

Hope that solves the problem. Worked for me!

Michael Godshall · Answer

Update

As Kenneth Reitz pointed out, a much simpler solution has been added to the heroku-python-buildpack. Add a nltk.txt file to your root directory and list your corpora inside. See https://devcenter.heroku.com/articles/python-nltk for details.

Original Answer

Here's a cleaner solution that allows you to install the NLTK data directly on Heroku without adding it to your git repo.

I used similar steps to install Textblob on Heroku, which uses NLTK as a dependency. I've made some minor adjustments to my original code in steps 3 and 4 that should work for an NLTK only installation.

The default heroku buildpack includes a post_compile step that runs after all of the default build steps have been completed:

# post_compile
#!/usr/bin/env bash

if [ -f bin/post_compile ]; then
    echo "-----> Running post-compile hook"
    chmod +x bin/post_compile
    sub-env bin/post_compile
fi

As you can see, it looks in your project directory for your own post_compile file in the bin directory, and it runs it if it exists. You can use this hook to install the nltk data.

Create the bin directory in the root of your local project.

Add your own post_compile file to the bin directory.

# bin/post_compile
#!/usr/bin/env bash

if [ -f bin/install_nltk_data ]; then
    echo "-----> Running install_nltk_data"
    chmod +x bin/install_nltk_data
    bin/install_nltk_data
fi

echo "-----> Post-compile done"

Add your own install_nltk_data file to the bin directory.

# bin/install_nltk_data
#!/usr/bin/env bash

source $BIN_DIR/utils

echo "-----> Starting nltk data installation"

# Assumes NLTK_DATA environment variable is already set
# $ heroku config:set NLTK_DATA='/app/nltk_data'

# Install the nltk data
# NOTE: The following command installs the stopwords corpora, 
# so you may want to change for your specific needs.  
# See http://www.nltk.org/data.html
python -m nltk.downloader stopwords

# If using Textblob, use this instead:
# python -m textblob.download_corpora lite

# Open the NLTK_DATA directory
cd ${NLTK_DATA}

# Delete all of the zip files
find . -name "*.zip" -type f -delete

echo "-----> Finished nltk data installation"

Add nltk to your requirements.txt file (Or textblob if you are using Textblob).
Commit all of these changes to your repo.
Set the NLTK_DATA environment variable on your heroku app.
```
$ heroku config:set NLTK_DATA='/app/nltk_data'
```
Deploy to Heroku. You will see the post_compile step trigger at the end of the deployment, followed by the nltk download.

I hope you found this helpful! Enjoy!

LookupError: Resource 'corpora/stopwords' not found

Tags:

python

heroku

flask

python-2.7

nltk

user3534472

2 Answers

Gaurang

Update

Original Answer

Michael Godshall

Recent Activity

Donate For Us

LookupError: Resource 'corpora/stopwords' not found

Tags:

python

heroku

flask

python-2.7

nltk

user3534472

2 Answers

Gaurang

Update

Original Answer

Michael Godshall

Related questions

Recent Activity

Donate For Us