I'm building an app that uses Flask and Scrapy. When the root URL of my app is accessed, it processes some data and displays it. In addition, I also want to (re)start my spider if it is not already running. Since my spider takes about 1.5 hrs to finish running, I run it as a background process using threading. Here is a minimal example (you'll also need testspiders):
import os
from flask import Flask, render_template
import threading
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from testspiders.spiders.followall import FollowAllSpider
def crawl():
spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()
app = Flask(__name__)
@app.route('/')
def main():
run_in_bg = threading.Thread(target=crawl, name='crawler')
thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]
if 'crawler' not in thread_names:
run_in_bg.start()
return 'hello world'
if __name__ == "__main__":
port = int(os.environ.get('PORT', 5000))
app.run(host='0.0.0.0', port=port)
As a side note, the following lines were my ad hoc approach to try and identify if my crawler thread is still running. If there's a more idiomatic approach, I'd appreciate some guidance.
run_in_bg = threading.Thread(target=crawl, name='crawler')
thread_names = [t.name for t in threading.enumerate() if isinstance(t, threading.Thread)]
if 'crawler' not in thread_names:
run_in_bg.start()
Moving on to the problem — if I save the above script as crawler.py
, run python crawler.py
and access localhost:5000
, then I get the following error (ignore scrapy's HtmlXPathSelector
deprecation errors):
exceptions.ValueError: signal only works in main thread
Although the spider runs, it doesn't stop because the signals.spider_closed
signal only works in the main thread (according to this error). As expected, subsequent requests to the root URL results in copious errors.
How can I design my app to start my spider if it is not already crawling, while at the same time returning control back to my app immediately (i.e. I don't want to wait for the crawler to finish) for other stuff?
Its not the best idea to have flask start long running threads like this.
I would recommend using a queue system like celery or rabbitmq. Your flask application can put tasks on the queue that you would like to do in the background and then return immediately.
Then you can have workers outside of your main app process those tasks and do all of your scraping.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With