Using downloaded NLTK data on AWS Elastic Beanstalk

Question

I have a django app running on AWS Elastic Beanstalk. I use an nltk corpus package (stopwords) which I obtain by using the NLTK downloader.

For a quick hack, I just ran the nltk downloader on my current (single) elastic beanstalk EC2 instance and saved the needed corpus to /usr/local/share/nltk_data. This works on a single instance but obviously when my load balancer decides to create new instances this will be wiped (it survives deployments).

My question is what is the best approach here specifically for this data?

Should I store it on S3 and tie that to my elastic beanstalk?

Or, is it easier (and better) to write a (python?) script that will be called by the EB configuration for each new instance that will download and place the data in a folder accessible by the app (for the life of the instance)? That way if I need to add other corpus downloads or do python-specific or nltk-specific things it's happening in python and not requiring manual S3 work?

If someone supports writing a script for EB configuration, an example would be great, I am not sure how to do this exactly.

Thanks!

cerd · Accepted Answer

It is really very easy to use S3 for this specific use-case (in combination with IAM and EC2 instance roles).

Even with fast changing data (nltk corpora are slow changing i assume), one can just manually sync the differences to an existing s3 location so that your instances will have new data available when they need it.

The key is to give your instances IAM roles, using Instance Profiles. With a proper policy, they can securely access s3 without having to define your aws credentials manually, or in a script that needs to access the AWS CLI on instance start, etc.

There are significant security benefits to using Instance Profiles for IAM permissions to AWS resources, as it eliminates hard-coding credentials into scripts, your git code etc.

Then assuming AWS CLI installed on linux via apt, pip etc:

 # create the bucket (once). 
 # put in a region / az where your ec2 instances are 
 # to minimize data xfer

 # can run these from wherever to get your bucket / data up
 aws s3 mb s3://mybucket --region us-west-1

 # sync from wherever the first time & whenever needed
 aws s3 sync /usr/local/share/nltk_data s3://mybucket


 # can run the below on your instances
 #
 # put instance startup script after install of awscli etc.
 # or in myscript.sh file on your instance (even a gist)
 # wherever you want an instance to have your data or sync up

 aws s3 sync s3://mybucket/nltk_data /path/where/i/need

The nice thing about the sync command is that it will not copy over files which have not been modified when putting up to s3 and pulling down. This makes it super handy for things like common datasets, backups, etc.

Michael · Answer

While I will eventually test whether the other answer works more generally for more complicated nltk packages, stopwords really is just a list (or set of lists I guess if you need multiple languages) that you can cut and paste into your script:

>>> from nltk.corpus import stopwords
>>> stopwordlist = stopwords.words('english')
>>> print(stopwordlist)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

So I just just directly defined it in my script without importing anything:

stopwordlist = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now']

Using downloaded NLTK data on AWS Elastic Beanstalk

Tags:

python

amazon-web-services

amazon-s3

amazon-ec2

kilgoretrout

2 Answers

cerd

Michael

Recent Activity

Donate For Us

Using downloaded NLTK data on AWS Elastic Beanstalk

Tags:

python

amazon-web-services

amazon-s3

amazon-ec2

kilgoretrout

2 Answers

cerd

Michael

Related questions

Recent Activity

Donate For Us