I am trying to run a webapp on Heroku using Flask. The webapp is programmed in Python with the NLTK (Natural Language Toolkit library).
One of the file has the following header:
import nltk, json, operator
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
When the webpage with the stopwords code is called, it produces the following error:
LookupError:
**********************************************************************
Resource 'corpora/stopwords' not found. Please use the NLTK
Downloader to obtain the resource: >>> nltk.download()
Searched in:
- '/app/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
**********************************************************************
The exact code used:
#remove punctuation
toker = RegexpTokenizer(r'((?<=[^\w\s])\w(?=[^\w\s])|(\W))+', gaps=True)
data = toker.tokenize(data)
#remove stop words and digits
stopword = stopwords.words('english')
data = [w for w in data if w not in stopword and not w.isdigit()]
The webapp on Heroku doesn't produce the Lookup error when stopword = stopwords.words('english')
is commented out.
The code runs without a glitch on my local computer. I have have installed the required libraries on my computer using
pip install requirements.txt
The virtual environment provided by Heroku was running when I tested the code on my computer.
I have also tried the NLTK provided by two different sources, but the LookupError
is still there. The two sources I used are:
http://pypi.python.org/packages/source/n/nltk/nltk-2.0.1rc4.zip
https://github.com/nltk/nltk.git
The problem is that the corpus ('stopwords' in this case) doesn't get uploaded to Heroku. Your code works on your local machine because it already has the NLTK corpus. Please follow these steps to solve the issue.
nltk.data.path.append('path_to_nltk_data')
to the Python file that's actually using nltk.Hope that solves the problem. Worked for me!
As Kenneth Reitz pointed out, a much simpler solution has been added to the heroku-python-buildpack. Add a nltk.txt
file to your root directory and list your corpora inside. See https://devcenter.heroku.com/articles/python-nltk for details.
Here's a cleaner solution that allows you to install the NLTK data directly on Heroku without adding it to your git repo.
I used similar steps to install Textblob on Heroku, which uses NLTK as a dependency. I've made some minor adjustments to my original code in steps 3 and 4 that should work for an NLTK only installation.
The default heroku buildpack includes a post_compile
step that runs after all of the default build steps have been completed:
# post_compile
#!/usr/bin/env bash
if [ -f bin/post_compile ]; then
echo "-----> Running post-compile hook"
chmod +x bin/post_compile
sub-env bin/post_compile
fi
As you can see, it looks in your project directory for your own post_compile
file in the bin
directory, and it runs it if it exists. You can use this hook to install the nltk data.
Create the bin
directory in the root of your local project.
Add your own post_compile
file to the bin
directory.
# bin/post_compile
#!/usr/bin/env bash
if [ -f bin/install_nltk_data ]; then
echo "-----> Running install_nltk_data"
chmod +x bin/install_nltk_data
bin/install_nltk_data
fi
echo "-----> Post-compile done"
Add your own install_nltk_data
file to the bin
directory.
# bin/install_nltk_data
#!/usr/bin/env bash
source $BIN_DIR/utils
echo "-----> Starting nltk data installation"
# Assumes NLTK_DATA environment variable is already set
# $ heroku config:set NLTK_DATA='/app/nltk_data'
# Install the nltk data
# NOTE: The following command installs the stopwords corpora,
# so you may want to change for your specific needs.
# See http://www.nltk.org/data.html
python -m nltk.downloader stopwords
# If using Textblob, use this instead:
# python -m textblob.download_corpora lite
# Open the NLTK_DATA directory
cd ${NLTK_DATA}
# Delete all of the zip files
find . -name "*.zip" -type f -delete
echo "-----> Finished nltk data installation"
Add nltk
to your requirements.txt
file (Or textblob
if you are using Textblob).
Commit all of these changes to your repo.
Set the NLTK_DATA environment variable on your heroku app.
$ heroku config:set NLTK_DATA='/app/nltk_data'
Deploy to Heroku. You will see the post_compile
step trigger at the end of the deployment, followed by the nltk download.
I hope you found this helpful! Enjoy!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With