Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scrapy function to be called just before spider_closed signal sent?

I wrote a spider using scrapy, one that makes a whole bunch of HtmlXPathSelector Requests to separate sites. It creates a row of data in a .csv file after each request is (asynchronously) satisfied. It's impossible to see which request is satisfied last, because the request is repeated if no data was extracted yet (occasionally it misses the data a few times). Even though I start with a neat list, the output is jumbled because the rows are written immediately after data is extracted.

Now I'd like to sort that list based on one column, but after every request is done. Can the 'spider_closed' signal be used to trigger a real function? As below, I tried connecting the signal with dispatcher, but this function seems to only print out things, rather than work with variables or even call other functions.

def start_requests(self)
    ...  dispatcher.connect(self.spider_closed, signal=signals.engine_stopped) ....


def spider_closed(spider):
    print 'this gets printed alright'   # <-only if the next line is omitted...
    out = self.AnotherFunction(in)      # <-This doesn't seem to run
like image 493
corg Avatar asked Nov 13 '22 19:11

corg


1 Answers

I hacked together a pipeline to solve this problem for you.

file: Project.middleware_module.SortedCSVPipeline

import csv
from scrapy import signals


class SortedCSVPipeline(object):

    def __init__(self):
        self.items = []
        self.file_name = r'YOUR_FILE_PATH_HERE'
        self.key = 'YOUR_KEY_HERE'

    @classmethod
    def from_crawler(cls, crawler):
        pipeline = cls()
        crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
        return pipeline

    def spider_closed(self, spider):
        for item in sorted(self.items, key=lambda k: k[self.key]):
            self.write_to_csv(item)

    def process_item(self, item, spider):
        self.items.append(item)
        return item

    def write_to_csv(self, item):
       writer = csv.writer(open(self.file_name, 'a'), lineterminator='\n')
       writer.writerow([item[key] for key in item.keys()])

file: settings.py

ITEM_PIPELINES = {"Project.middleware_module.SortedCSVPipeline.SortedCSVPipeline" : 1000}

When running this you won't need to use an item exporter anymore because this pipeline will do the csv writing for you. Also, the 1000 in the pipeline entry in your setting needs to be a higher value than all other pipelines that you want to run before this one. I tested this in my project and it resulted in a csv file sorted by the column I specified! HTH

Cheers

like image 108
rocktheartsm4l Avatar answered Nov 15 '22 12:11

rocktheartsm4l