Scrapy on a schedule

Tags:

Getting Scrapy to run on a schedule is driving me around the Twist(ed).

I thought the below test code would work, but I get a twisted.internet.error.ReactorNotRestartable error when the spider is triggered a second time:

Click to copy

from quotesbot.spiders.quotes import QuotesSpider
import schedule
import time
from scrapy.crawler import CrawlerProcess

def run_spider_script():
    process.crawl(QuotesSpider)
    process.start()


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_script)

while True:
    schedule.run_pending()
    time.sleep(1)

I'm going to guess that as part of the CrawlerProcess, the Twisted Reactor is called to start again, when that's not required and so the program crashes. Is there any way I can control this?

Also at this stage if there's an alternative way to automate a Scrapy spider to run on a schedule, I'm all ears. I tried scrapy.cmdline.execute , but couldn't get that to loop either:

Click to copy

from quotesbot.spiders.quotes import QuotesSpider
from scrapy import cmdline
import schedule
import time
from scrapy.crawler import CrawlerProcess


def run_spider_cmd():
    print("Running spider")
    cmdline.execute("scrapy crawl quotes".split())


process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
})


schedule.every(5).seconds.do(run_spider_cmd)

while True:
    schedule.run_pending()
    time.sleep(1)

EDIT

Adding code, which uses Twisted task.LoopingCall() to run a test spider every few seconds. Am I going about this completely the wrong way to schedule a spider that runs at the same time each day?

Click to copy

from twisted.internet import reactor
from twisted.internet import task
from scrapy.crawler import CrawlerRunner
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.xpath('//div[@class="quote"]')

        for quote in quotes:

            author = quote.xpath('.//small[@class="author"]/text()').extract_first()
            text = quote.xpath('.//span[@class="text"]/text()').extract_first()

            print(author, text)


def run_crawl():

    runner = CrawlerRunner()
    runner.crawl(QuotesSpider)


l = task.LoopingCall(run_crawl)
l.start(3)

reactor.run()

592

asked May 28 '17 15:05

2 Answers

You can use apscheduler

Click to copy

pip install apscheduler

Click to copy

# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from apscheduler.schedulers.twisted import TwistedScheduler

from Demo.spiders.baidu import YourSpider

process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'interval', args=[YourSpider], seconds=10)
scheduler.start()
process.start(False)

188

answered Oct 05 '22 20:10

First noteworthy statement, there's usually only one Twisted reactor running and it's not restartable (as you've discovered). The second is that blocking tasks/functions should be avoided (ie. time.sleep(n)) and should be replaced with async alternatives (ex. 'reactor.task.deferLater(n,...)`).

To use Scrapy effectively from a Twisted project requires the scrapy.crawler.CrawlerRunner core API as opposed to scrapy.crawler.CrawlerProcess. The main difference between the two is that CrawlerProcess runs Twisted's reactor for you (thus making it difficult to restart the reactor), where as CrawlerRunner relies on the developer to start the reactor. Here's what your code could look like with CrawlerRunner:

Click to copy

from twisted.internet import reactor
from quotesbot.spiders.quotes import QuotesSpider
from scrapy.crawler import CrawlerRunner

def run_crawl():
    """
    Run a spider within Twisted. Once it completes,
    wait 5 seconds and run another spider.
    """
    runner = CrawlerRunner({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        })
    deferred = runner.crawl(QuotesSpider)
    # you can use reactor.callLater or task.deferLater to schedule a function
    deferred.addCallback(reactor.callLater, 5, run_crawl)
    return deferred

run_crawl()
reactor.run()   # you have to run the reactor yourself

answered Oct 05 '22 21:10

notorious.no

Related questions
                            
                                TFRecordReader seems extremely slow , and multi-threads reading not working
                            
                                Python OO program structure planning
                            
                                Use default argument if argument is None on python method call
                            
                                Box Plot of a many Pandas Dataframes
                            
                                Can't import serializer from other serializer in django rest-framework?
                            
                                Read and write from Unix socket connection with Python
                            
                                Where to configure logging?
                            
                                Multiple subset sum calculation
                            
                                Tmpfile error with django-import-export on Heroku
                            
                                Python - Disable warnings for SecurityWarning: Certificate has no `subjectAltName`, RFC 2818
                            
                                Pandas Dataframe: join items in range based on their geo coordinates (longitude and latitude)
                            
                                Python function equivalent to R's `pretty()`?
                            
                                How do stream data to a Bokeh plot in Jupyter with a high refresh rate?
                            
                                How to set up multiple Dag directories in airflow
                            
                                Using Sphinx to automatically generate a separate document for each function
                            
                                Jupyter notebook: Widget Javascript not detected
                            
                                Django F expression on datetime objects
                            
                                Keras: real amount of GPU memory used
                            
                                Python3 ImportError: No module named '_tkinter' [duplicate]
                            
                                "Unable to get Filesystem for path" error when training neural network on google cloud

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy on a schedule

Tags:

python

web-scraping

scrapy

twisted

itzafugazi

People also ask

2 Answers

samuel161

notorious.no

Recent Activity

Donate For Us