Using NLTK corpora with AWS Lambda functions in Python

Tags:

I'm encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I'm aware that the corpora need to be downloaded and have done so with NLTK.download('stopwords') and included them in the zip file used to upload the lambda modules in nltk_data/corpora/stopwords.

The usage in the code is as follows:

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nltk.data.path.append("/nltk_data")

This returns the following error from the Lambda log output

module initialization error: 
**********************************************************************
  Resource u'corpora/stopwords' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/sbx_user1062/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data'
**********************************************************************

I have also tried to load the data directly by including

nltk.data.load("/nltk_data/corpora/stopwords/english")

Which yields a different error below

module initialization error: Could not determine format for file:///stopwords/english based on its file
extension; use the "format" argument to specify the format explicitly.

It's possible that it has a problem loading the data from the Lambda zip and needs it stored externally.. say on S3, but that seems a bit strange.

Any idea what format the

Does anyone know where I could be going wrong?

299

asked Feb 22 '17 04:02

Praxis

1 Answers

Another solution is to use Lambda's ephemeral storage at the location /tmp

So, you would have something like this:

import nltk
import json
from nltk.tokenize import word_tokenize

nltk.data.path.append("/tmp")

nltk.download("punkt", download_dir = "/tmp")

At runtime punkt will download to the /tmp directory, which is writable. However, this likely isn't a great solution if you have huge concurrency.

168

answered Sep 21 '22 12:09

Anonymous Juan

Related questions
                            
                                Is Python's dict.pop atomic?
                            
                                Running selenium behind a proxy server
                            
                                How to write a dictionary into an existing file?
                            
                                Flask - ImportError: No module named migrate.versioning
                            
                                Python - nohup.out don't show print statement
                            
                                Indentation not working properly in emacs for python
                            
                                Tornado coroutine
                            
                                how to make post request in python
                            
                                unconverted data remains: .387000 in Python
                            
                                How to specify a variable in pandas as ordinal/categorical?
                            
                                Replace exact substring in python [duplicate]
                            
                                How to set a random integer as the default value for a Django CharField?
                            
                                Python docx Replace string in paragraph while keeping style
                            
                                Remove leap year day from pandas dataframe
                            
                                Python, numpy; How to best deal with possible 0d arrays
                            
                                Better way to check if all lists in a list are the same length? [duplicate]
                            
                                Avoid tensorflow print on standard error
                            
                                How to pass arguments to animation.FuncAnimation()?
                            
                                Updating Python using 'PIP'
                            
                                Conda - offline install / update

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using NLTK corpora with AWS Lambda functions in Python

Tags:

python

aws-lambda

nltk

Praxis

People also ask

1 Answers

Anonymous Juan

Recent Activity

Donate For Us