Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using NLTK corpora with AWS Lambda functions in Python

I'm encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I'm aware that the corpora need to be downloaded and have done so with NLTK.download('stopwords') and included them in the zip file used to upload the lambda modules in nltk_data/corpora/stopwords.

The usage in the code is as follows:

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nltk.data.path.append("/nltk_data")

This returns the following error from the Lambda log output

module initialization error: 
**********************************************************************
  Resource u'corpora/stopwords' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/sbx_user1062/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data'
**********************************************************************

I have also tried to load the data directly by including

nltk.data.load("/nltk_data/corpora/stopwords/english")

Which yields a different error below

module initialization error: Could not determine format for file:///stopwords/english based on its file
extension; use the "format" argument to specify the format explicitly.

It's possible that it has a problem loading the data from the Lambda zip and needs it stored externally.. say on S3, but that seems a bit strange.

Any idea what format the

Does anyone know where I could be going wrong?

like image 299
Praxis Avatar asked Feb 22 '17 04:02

Praxis


People also ask

Can I use Python packages with AWS Lambda?

Lambda supports Python, which is a great option if you've got experience using it. However, one of the downsides to Lambda is that by default you won't be able to import your trusted packages, like Pandas.

What is the use of NLTK corpus?

NLTK corpus readers. The modules in this package provide functions that can be used to read corpus files in a variety of formats. These functions can be used to read both the corpus files that are distributed in the NLTK corpus package, and corpus files that are part of external corpora.

What is NLTK corpus in Python?

The nltk.corpus package defines a collection of corpus reader classes, which can be used to access the contents of a diverse set of corpora. The list of available corpora is given at: https://www.nltk.org/nltk_data/ Each corpus reader class is specialized to handle a specific corpus format.


1 Answers

Another solution is to use Lambda's ephemeral storage at the location /tmp

So, you would have something like this:

import nltk
import json
from nltk.tokenize import word_tokenize

nltk.data.path.append("/tmp")

nltk.download("punkt", download_dir = "/tmp")

At runtime punkt will download to the /tmp directory, which is writable. However, this likely isn't a great solution if you have huge concurrency.

like image 168
Anonymous Juan Avatar answered Sep 21 '22 12:09

Anonymous Juan