I am attempting to build an API that will run a Scrapy web spider when requested via a websocket message.
I would like to forward the logging output to the websocket client so you see what's going on in the - sometimes quite long-running - process. When finished, I will also send the scraped results.
As it is possible to run Scrapy in-process, I would like to do exactly that. I found a solution that will stream an external process to a websocket here, but that doesn't seem right if it's possible to run Scrapy inside the server.
https://tomforb.es/displaying-a-processes-output-on-a-web-page-with-websockets-and-python
There are two ways I can imagine to make this work in Twisted: Somehow using a LogObserver, or defining a LogHandler (probably StreamHandler with StringIO) and then handle the Stream in some way in Twisted with autobahn.websocket classes like WebSocketServerProtocol.
Now I am quite stuck and don't know how to connect the ends.
Could someone please provide a short example how to stream logging output from twisted logging (avoiding a file if possible) to a websocket client?
I managed to solve this by myself somehow and wanted to let you know how I did it:
The basic idea was to have a process that gets called remotely and output a streaming log to a client, usually a browser.
Instead of building all the nasty details myself, I decided to go with autobahn.ws and crossbar.io, providing pubsub and rpc via the Wamp protocol which is essentially just JSON on websockets - exactly what I had planned to build, just way more advanced!
Here is a very basic example:
from twisted.internet.defer import inlineCallbacks
from autobahn.twisted.wamp import ApplicationSession
from example.spiders.basic_spider import BasicSpider
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
import logging
class PublishLogToSessionHandler(logging.Handler):
def __init__(self, session, channel):
logging.Handler.__init__(self)
self.session = session
self.channel = channel
def emit(self, record):
self.session.publish(self.channel, record.getMessage())
class AppSession(ApplicationSession):
configure_logging(install_root_handler=False)
@inlineCallbacks
def onJoin(self, details):
logging.root.addHandler(PublishLogToSessionHandler(self, 'com.example.crawler.log'))
# REGISTER a procedure for remote calling
def crawl(domain):
runner = CrawlerRunner(get_project_settings())
runner.crawl("basic", domain=domain)
return "Running..."
yield self.register(crawl, 'com.example.crawler.crawl')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With