Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lambda not supporting NLTK file size

I am writing a python script that analyses a piece of text and returns the data in JSON format. I am using NLTK, to analyze the data. Basically, this is my flow:

Create an endpoint (API gateway) -> calls my lambda function -> returns JSON of required data.

I wrote my script, deployed to lambda but I ran into this issue:

Resource \u001b[93mpunkt\u001b[0m not found. Please use the NLTK Downloader to obtain the resource:

\u001b[31m>>> import nltk nltk.download('punkt') \u001b[0m
Searched in: - '/home/sbx_user1058/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' - '/var/lang/nltk_data' - '/var/lang/lib/nltk_data'

Even after downloading 'punkt', my script still gave me the same error. I tried the solutions here :

Optimizing python script extracting and processing large data files

but the issue is, the nltk_data folder is huge, while lambda has a size restriction.

How can I fix this issue? Or where else can I use my script and still integrate API call?

I am using serverless to deploy my python scripts.

like image 886
noor Avatar asked Oct 20 '17 09:10

noor


People also ask

Where does NLTK store data?

It depends on where you set the destination folder when you download the data using nltk. download(). On Windows 10, the default destination is either C:\Users\narae\nltk_data or C:\Users\narae\AppData\Roaming\nltk_data, but you can specify a different directory before downloading.

What is NLTK data?

NLTK is a leading platform for building Python programs to work with human language data.


1 Answers

There are two things that you can do:

  1. The errors seems like the path is not being defined properly, maybe set it as an env Variable?

sys.path.append(os.path.abspath('/var/task/nltk_data/')

or this way

  1. Once you run nltk.download(), then copy it to the root folder of your AWS lambda application. (Name the dir to be called "nltk_data".)

  2. In the lambda function dashboard (in the AWS console), add NLTK_DATA=./nltk_data as a key-var Environment Variable.


  1. reduce the size of the nltk downloads, since you won't be needing all of them.

    1. Delete all the zip files, keep only the needed section, for example: stopwords. That can be moved into: save nltk_data/corpora/stopwords and delete the rest.

    2. Or If you need tokenizers save to nltk_data/tokenizers/punkt. Most of these can be separately downloaded: python -m nltk.downloader punkt, then copy over the files.

like image 121
0bserver07 Avatar answered Sep 23 '22 07:09

0bserver07