I built a crawler using the python scrapy library. It works perfectly and reliably when running locally. I have attempted to port it over to the AWS lambda (I have packaged it appropriately). However when I run it the process isn't blocked whilst the crawl runs and instead completes before the crawlers can return giving no results. These are the last lines I get out of logs before it exits:
2018-09-12 18:58:07 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-09-12 18:58:07 [scrapy.core.engine] INFO: Spider opened
Whereas normally I would get a whole of information about the pages being crawled. I've tried sleeping after starting the crawl, installing crochet and adding its declarators and installing and using this specific framework that sounds like it addresses this problem but it also doesn't work.
I'm sure this is an issue with Lambda not respecting scrapys blocking, but I have no idea on how to address it.
I had the same problem and fixed it by creating empty modules for sqlite3
, as described in this answer: https://stackoverflow.com/a/44532317/5441099. Appearently, Scrapy imports sqlite3
, but doesn't necessarily use it. Python3 expects sqlite3
to be on the host machine, but the AWS Lambda machines don't have it. The error message doesn't always show up in the logs.
Which means you can make it work by switching to Python2, or creating empty modules for sqlite3
like I did.
My entry file for running the crawler is as follows, and it works on Lambda with Python3.6:
# run_crawler.py
# crawl() is invoked from the handler function in Lambda
import os
from my_scraper.spiders.my_spider import MySpider
from scrapy.crawler import CrawlerProcess
# Start sqlite3 fix
import imp
import sys
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
# End sqlite3 fix
def crawl():
process = CrawlerProcess(dict(
FEED_FORMAT='json',
FEED_URI='s3://my-bucket/my_scraper_feed/' +
'%(name)s-%(time)s.json',
AWS_ACCESS_KEY_ID=os.getenv('AWS_ACCESS_KEY_ID'),
AWS_SECRET_ACCESS_KEY=os.getenv('AWS_SECRET_ACCESS_KEY'),
))
process.crawl(MySpider)
process.start() # the script will block here until all crawling jobs are finished
if __name__ == '__main__':
crawl()
As @viktorAndersen's answers solves the issue of scrapy crashing/ working not as expected in AWS Lambda.
I had a heavy Spider crawling 2000 urls and I faced 2 problems
ReactorNotRestartable error when I ran scrapy function more than 1 time. For the first time it working fine, but from second invokation I ran into the ReactorNotRestartable
.
Having timeout exception from crochet.wait_for()
when spider is taking longer than the expected duration
This post is inspired from https://stackoverflow.com/a/57347964/12951298
import sys
import imp
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from scrapy.utils.log import configure_logging
from twisted.internet import reactor;
from crochet import setup, wait_for
setup()
sys.modules["sqlite"] = imp.new_module("sqlite")
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")
@wait_for(900)
def crawl():
'''
wait_for(Timeout = inseconds)
change the timeout accordingly
this function will raise crochet.TimeoutError if more than 900
seconds elapse without an answer being received
'''
spider_name="header_spider" #your spider name
project_settings = get_project_settings()
spider_loader = SpiderLoader(project_settings)
spider_cls = spider_loader.load(spider_name)
configure_logging()
process = CrawlerRunner({**project_settings});
d = process.crawl(spider_cls);
return d;
if __name__ == "__main__":
main('', '')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With