Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream scrapy logging output to websocket

I am attempting to build an API that will run a Scrapy web spider when requested via a websocket message.

I would like to forward the logging output to the websocket client so you see what's going on in the - sometimes quite long-running - process. When finished, I will also send the scraped results.

As it is possible to run Scrapy in-process, I would like to do exactly that. I found a solution that will stream an external process to a websocket here, but that doesn't seem right if it's possible to run Scrapy inside the server.

https://tomforb.es/displaying-a-processes-output-on-a-web-page-with-websockets-and-python

There are two ways I can imagine to make this work in Twisted: Somehow using a LogObserver, or defining a LogHandler (probably StreamHandler with StringIO) and then handle the Stream in some way in Twisted with autobahn.websocket classes like WebSocketServerProtocol.

Now I am quite stuck and don't know how to connect the ends.

Could someone please provide a short example how to stream logging output from twisted logging (avoiding a file if possible) to a websocket client?

like image 455
Gregor Melhorn Avatar asked Mar 14 '23 01:03

Gregor Melhorn


1 Answers

I managed to solve this by myself somehow and wanted to let you know how I did it:

The basic idea was to have a process that gets called remotely and output a streaming log to a client, usually a browser.

Instead of building all the nasty details myself, I decided to go with autobahn.ws and crossbar.io, providing pubsub and rpc via the Wamp protocol which is essentially just JSON on websockets - exactly what I had planned to build, just way more advanced!

Here is a very basic example:

from twisted.internet.defer import inlineCallbacks

from autobahn.twisted.wamp import ApplicationSession
from example.spiders.basic_spider import BasicSpider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings

import logging

class PublishLogToSessionHandler(logging.Handler):
    def __init__(self, session, channel):
        logging.Handler.__init__(self)
        self.session = session
        self.channel = channel

    def emit(self, record):
        self.session.publish(self.channel, record.getMessage())


class AppSession(ApplicationSession):

    configure_logging(install_root_handler=False)

    @inlineCallbacks
    def onJoin(self, details):
        logging.root.addHandler(PublishLogToSessionHandler(self, 'com.example.crawler.log'))

        # REGISTER a procedure for remote calling
        def crawl(domain):
            runner = CrawlerRunner(get_project_settings())
            runner.crawl("basic", domain=domain)
            return "Running..."

        yield self.register(crawl, 'com.example.crawler.crawl')
like image 57
Gregor Melhorn Avatar answered Mar 24 '23 04:03

Gregor Melhorn