Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Closing database connection from pipeline and middleware in Scrapy

I have a Scrapy project that uses custom middleware and a custom pipeline to check and store entries in a Postgres DB. The middleware looks a bit like this:

class ExistingLinkCheckMiddleware(object):

    def __init__(self):

        ... open connection to database

    def process_request(self, request, spider):

        ... before each request check in the DB
        that the page hasn't been scraped before

The pipeline looks similar:

class MachinelearningPipeline(object):

    def __init__(self):

        ... open connection to database

    def process_item(self, item, spider):

        ... save the item to the database

It works fine, but I can't find a way to cleanly close these database connections when the spider finishes, which irks me.

Does anyone know how to do that?

like image 906
Jamie Brown Avatar asked May 23 '13 10:05

Jamie Brown


1 Answers

I think the best way to do it is to use scrapy's signal spider_closed, e.g.:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class ExistingLinkCheckMiddleware(object):

    def __init__(self):
        # open connection to database

        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider, reason):
        # close db connection

    def process_request(self, request, spider):
        # before each request check in the DB
        # that the page hasn't been scraped before

See also:

  • scrapy: Call a function when a spider quits
  • Scrapy pipeline spider_opened and spider_closed not being called

Hope that helps.

like image 82
alecxe Avatar answered Sep 25 '22 13:09

alecxe