Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get Python Scrapy Crawler details?

Tags:

python

php

scrapy

I am using the Python Scrapy tool to extract data from websites. I'm firing Scrapy from my php code using proc_open(). Now I need to maintain a Dashboard kind of thing. Is there a way in Scrapy to get Crawler details like:

  1. Time taken by Crawler to run.
  2. Start and Stop Time of crawler.
  3. Crawler Status (active or stopped).
  4. List of Crawlers running simultaneously.
like image 242
kishan Avatar asked Nov 24 '25 05:11

kishan


1 Answers

Your problem can be solved by using an extension.

For example:

from datetime import datetime

from scrapy import signals
from twisted.internet.task import LoopingCall


class SpiderDetails(object):
    """Extension for collect spider information like start/stop time."""

    update_interval = 5  # in seconds

    def __init__(self, crawler):
        # keep a reference to the crawler in case is needed to access to more information
        self.crawler = crawler
        # keep track of polling calls per spider
        self.pollers = {}

    @classmethod
    def from_crawler(cls, crawler):
        instance = cls(crawler)
        crawler.signals.connect(instance.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(instance.spider_closed, signal=signals.spider_closed)
        return instance

    def spider_opened(self, spider):
        now = datetime.utcnow()
        # store curent timestamp in db as 'start time' for this spider
        # TODO: complete db calls

        # start activity poller
        poller = self.pollers[spider.name] = LoopingCall(self.spider_update, spider)
        poller.start(self.update_interval)

    def spider_closed(self, spider, reason):
        # store curent timestamp in db as 'end time' for this spider
        # TODO: complete db calls

        # remove and stop activity poller
        poller = self.pollers.pop(spider.name)
        poller.stop()

    def spider_update(self, spider):
        now = datetime.utcnow()
        # update 'last update time' for this spider
        # TODO: complete db calls
        pass
  1. Time taken by Crawler to run: that is end time - start time. You can calculate it when reading from db or storing as well with the end time.

  2. Start and Stop Time of crawler: that is stored in spider_opened and spider_closed methods.

  3. Crawler Status (Active or Stopped): your crawler is active if now - last update time is close to 5 seconds. Otherwise, if the last update was a long time ago (30 secs, 5 minutes or more), then your spider has either stopped abnormally or hanged up. If the spider record has an end time then the crawler has finished correctly.

  4. List of Crawlers running simultaneously: your frontend can query for the records with an empty end time. Those spiders will be either running or dead (in case the last update time was a long time ago).

Take in consideration that the spider_closed signal will not be called in case the process finish abruptly. You will need to have a cron job to cleanup and/or update the dead records.

Don't forget to add the extension to your settings.py file, like:

EXTENSIONS = {
    # SpiderDetails class is in the file mybot/extensions.py
    'mybot.extensions.SpiderDetails': 1000,
}
like image 142
R. Max Avatar answered Nov 25 '25 20:11

R. Max



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!