Paths in AWS lambda with Python NLTK

Tags:

I'm encountering problems with the NLTK package in AWS Lambda. However I believe the issue is related more to path configurations in Lambda being incorrect. NLTK is having trouble finding data libraries that are stored locally and not part of the module install. Many of the solutions listed on SO are simple path configs as can be found here but I think this issue related to pathing in Lambda:

How to config nltk data directory from code?

What to download in order to make nltk.tokenize.word_tokenize work?

Should also mention this also relates to a previous question I posted here Using NLTK corpora with AWS Lambda functions in Python

but the issue seems more general and so I have elected to redefine the question as it relates how to correctly configure path environments in Lambda to work with modules that require external libraries like NLTK. NLTK stores a lot of it's data in a nltk_data folder locally, however including this folder within the lambda zip for upload, it doesn't seem to find it.

Also included in the Lambda func zip file are the following files and dirs:

\nltk_data\taggers\averaged_perceptron_tagger\averaged_perceptron_tagger.pickle
\nltk_data\tokenizers\punkt\english.pickle
\nltk_data\tokenizers\punkt\PY3\english.pickle

From the following site, it seems that var/task/ is the folder in which the lambda function executes and I have tried including this path to no avail. https://alestic.com/2014/11/aws-lambda-environment/

From the docs it also seems that there are a number of environment variables that can be used however I'm not sure how to include them in a python script (coming from windows, not linux) http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Hoping to throw this up here incase anyone has experience in configuring Lambda paths. I haven't seen a lot of questions relating to this specific issue despite searching, so hoping it could be useful to resolve this

Code is here

import nltk
import pymysql.cursors
import re
import rds_config
import logging
from boto_conn import botoConn
from warnings import filterwarnings
from nltk import word_tokenize

nltk.data.path.append("/nltk_data/tokenizers/punkt")
nltk.data.path.append("/nltk_data/taggers/averaged_perceptron_tagger")

logger = logging.getLogger()

logger.setLevel(logging.INFO)

rds_host = "nodexrd2.cw7jbiq3uokf.ap-southeast-2.rds.amazonaws.com"
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name

filterwarnings("ignore", category=pymysql.Warning)


def parse():

    tknzr = word_tokenize

    stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself','yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
                 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that','these','those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do',
                 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of','at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
                 'below','to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then','once', 'here','there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
                 'some', 'such','no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will','just', 'don', 'should','now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn',
                 'haven', 'isn', 'ma','mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

    s3file = botoConn(None, 1).getvalue()
    db = pymysql.connect(rds_host, user=name, passwd=password, db=db_name, connect_timeout=5, charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
    lines = s3file.split('\n')

    for line in lines:

        tkn = tknzr(line)
        tagged = nltk.pos_tag(tkn)

        excl = ['the', 'and', 'of', 'at', 'what', 'to', 'it', 'a', 'of', 'i', 's', 't', 'is', 'I\'m', 'Im', 'U', 'RT', 'RTs', 'its']  # Arg

        x = [i for i in tagged if i[0] not in stopwords]
        x = [i for i in x if i[0] not in excl]
        x = [i for i in x if len(i[0]) > 1]
        x = [i for i in x if 'https' not in i[0]]
        x = [i for i in x if i[1] == 'NNP' or i[1] == 'VB' or i[1] == 'NN']
        x = [(re.sub(r'[^A-Za-z0-9]+' + '()', r'', i[0])) for i in x]
        sql_dat_a, sql_dat = [], []

Output log is here:

   **********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************: LookupError
Traceback (most recent call last):
  File "/var/task/Tweetscrape_Timer.py", line 27, in schedule
    server()
  File "/var/task/Tweetscrape_Timer.py", line 14, in server
    parse()
  File "/var/task/parse_to_SQL.py", line 91, in parse
    tkn = tknzr(line)
  File "/var/task/nltk/tokenize/__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/var/task/nltk/tokenize/__init__.py", line 93, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/var/task/nltk/data.py", line 808, in load
    opened_resource = _open(resource_url)
  File "/var/task/nltk/data.py", line 926, in _open
    return find(path_, path + ['']).open()
  File "/var/task/nltk/data.py", line 648, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************

372

asked Feb 22 '17 14:02

Praxis

1 Answers

Seems your current Python code runs from /var/task. I would suggest trying (haven't tried myself):

nltk.data.path.append("/var/task/nltk_data")

164

answered Nov 15 '22 09:11

Matt Fortier

Related questions
                            
                                python classmethods are not callable from class.__dict__
                            
                                python loggers as children of __main__
                            
                                Peewee ORM - Copying data from multiple database in a main one
                            
                                Pandas: What is a view?
                            
                                Understanding negative steps in list slicing
                            
                                PyCharm: Intellisense or auto-complete not working with Python 3.5.2
                            
                                Warmup django application during uwsgi chain-raload
                            
                                wxPython not found error
                            
                                How do I extract the list inside a string in Python?
                            
                                Frequency count of values in a column of a pandas DataFrame
                            
                                How math operators are identified
                            
                                Passing a file-like object to write() method of another file-like object
                            
                                How to iterate with zip(*list) unpacking mechanism in Julia?
                            
                                Can a package be required only for tests, not for installation?
                            
                                How to count the occurrences of sets which are part of a list in Python?
                            
                                Border colours of canvases (tkinter)
                            
                                How to determine if running current process is parent?
                            
                                Compare two dictionaries, remove key/value pair in one dict if it exists in the other
                            
                                Finding points within a bounding box with numpy
                            
                                How to write table structure data in PDF file in python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Paths in AWS lambda with Python NLTK

Tags:

python

aws-lambda

nltk

Praxis

People also ask

1 Answers

Matt Fortier

Recent Activity

Donate For Us