Unable to get results from Scrapy on AWS Lambda

Question

I built a crawler using the python scrapy library. It works perfectly and reliably when running locally. I have attempted to port it over to the AWS lambda (I have packaged it appropriately). However when I run it the process isn't blocked whilst the crawl runs and instead completes before the crawlers can return giving no results. These are the last lines I get out of logs before it exits:

2018-09-12 18:58:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-12 18:58:07 [scrapy.core.engine] INFO: Spider opened

Whereas normally I would get a whole of information about the pages being crawled. I've tried sleeping after starting the crawl, installing crochet and adding its declarators and installing and using this specific framework that sounds like it addresses this problem but it also doesn't work.

I'm sure this is an issue with Lambda not respecting scrapys blocking, but I have no idea on how to address it.

Viktor Andersen · Accepted Answer

I had the same problem and fixed it by creating empty modules for sqlite3, as described in this answer: https://stackoverflow.com/a/44532317/5441099. Appearently, Scrapy imports sqlite3, but doesn't necessarily use it. Python3 expects sqlite3 to be on the host machine, but the AWS Lambda machines don't have it. The error message doesn't always show up in the logs.

Which means you can make it work by switching to Python2, or creating empty modules for sqlite3 like I did.

My entry file for running the crawler is as follows, and it works on Lambda with Python3.6:

# run_crawler.py
# crawl() is invoked from the handler function in Lambda
import os
from my_scraper.spiders.my_spider import MySpider
from scrapy.crawler import CrawlerProcess
# Start sqlite3 fix
import imp
import sys
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
# End sqlite3 fix


def crawl():
    process = CrawlerProcess(dict(
        FEED_FORMAT='json',
        FEED_URI='s3://my-bucket/my_scraper_feed/' +
        '%(name)s-%(time)s.json',
        AWS_ACCESS_KEY_ID=os.getenv('AWS_ACCESS_KEY_ID'),
        AWS_SECRET_ACCESS_KEY=os.getenv('AWS_SECRET_ACCESS_KEY'),
    ))
    process.crawl(MySpider)
    process.start()  # the script will block here until all crawling jobs are finished


if __name__ == '__main__':
    crawl()

Teja Vemparala · Answer

As @viktorAndersen's answers solves the issue of scrapy crashing/ working not as expected in AWS Lambda.

I had a heavy Spider crawling 2000 urls and I faced 2 problems

ReactorNotRestartable error when I ran scrapy function more than 1 time. For the first time it working fine, but from second invokation I ran into the ReactorNotRestartable.
Having timeout exception from crochet.wait_for() when spider is taking longer than the expected duration

This post is inspired from https://stackoverflow.com/a/57347964/12951298

import sys
import imp
from scrapy.crawler import  CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor;

from crochet import setup, wait_for


setup()

sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")


@wait_for(900)
def crawl():
    '''
    wait_for(Timeout = inseconds)
    change the timeout accordingly
    this function will raise crochet.TimeoutError if more than 900
    seconds elapse without an answer being received
    '''
    spider_name="header_spider" #your spider name
    project_settings = get_project_settings()
    spider_loader = SpiderLoader(project_settings)

    spider_cls = spider_loader.load(spider_name)
    configure_logging()
    process = CrawlerRunner({**project_settings});
    d = process.crawl(spider_cls);
    return d;

if __name__ == "__main__":
    main('', '')

Unable to get results from Scrapy on AWS Lambda

Tags:

python

python-3.x

amazon-web-services

aws-lambda

scrapy

The Empire Strikes Back

2 Answers

Viktor Andersen

Teja Vemparala

Recent Activity

Donate For Us

Unable to get results from Scrapy on AWS Lambda

Tags:

python

python-3.x

amazon-web-services

aws-lambda

scrapy

The Empire Strikes Back

2 Answers

Viktor Andersen

Teja Vemparala

Related questions

Recent Activity

Donate For Us