I have a Scrapy project that uses custom middleware and a custom pipeline to check and store entries in a Postgres DB. The middleware looks a bit like this:
class ExistingLinkCheckMiddleware(object): def __init__(self): ... open connection to database def process_request(self, request, spider): ... before each request check in the DB that the page hasn't been scraped before
The pipeline looks similar:
class MachinelearningPipeline(object): def __init__(self): ... open connection to database def process_item(self, item, spider): ... save the item to the database
It works fine, but I can't find a way to cleanly close these database connections when the spider finishes, which irks me.
Does anyone know how to do that?
I think the best way to do it is to use scrapy's signal spider_closed, e.g.:
from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher
class ExistingLinkCheckMiddleware(object):
def __init__(self):
# open connection to database
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider, reason):
# close db connection
def process_request(self, request, spider):
# before each request check in the DB
# that the page hasn't been scraped before
See also:
Hope that helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With