Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to run spider multiple times with different input

I'm trying to scrape information from different sites about some products. Here is the structure of my program:

product_list = [iPad, iPhone, AirPods, ...]

def spider_tmall:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...


def spider_jd:
    self.driver.find_element_by_id('searchKeywords').send_keys(inputlist[a])

# ...

if __name__ == '__main__':

    for a in range(len(inputlist)):
        process = CrawlerProcess(settings={
            "FEEDS": {
                "itemtmall.csv": {"format": "csv",
                                  'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },
                "itemjd.csv": {"format": "csv",
                               'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },
        })

        process.crawl(tmallSpider)
        process.crawl(jdSpider)
        process.start()

Basically, I want to run all spiders for all inputs in product_list. Right now, my program only runs through all spiders once (in the case, it does the job for iPad) then there is ReactorNotRestartable Error and the program terminates. Anybody knows how to fix it? Also, my overall goal is to run the spider multiple times, the input doesn't necessarily have to be a list. It can be a CSV file or something else. Any suggestion would be appreciated!

like image 418
Tianhe Xie Avatar asked Jul 23 '20 09:07

Tianhe Xie


2 Answers

When you call process.start() Scrapy's CrawlerProcess will start a Twisted reactor that by default will stop when the crawlers are finished and it's not supposed to be restarted. One possible solution you can try is executing with stop_after_crawl param set to False:

 process.start(stop_after_crawl=False)

This will prevent the reactor to stop, bypassing the restart problem. Although I can't say it won't lead to other problems further, so you should test it to be sure.

In the documentation there is also an example to running multiple spiders in the same process, one of which actively runs/stops the reactor, but it uses CrawlerRunner instead of CrawlerProcess.

Finally, if the solutions above don't help, I would suggest trying this:

if __name__ == '__main__':

    process = CrawlerProcess(settings={
        "FEEDS": {
            "itemtmall.csv": {"format": "csv",
                              'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },
            "itemjd.csv": {"format": "csv",
                           'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },
    })
    for a in range(len(inputlist)):
        process.crawl(tmallSpider)
        process.crawl(jdSpider)
    process.start()

The point here is that the process is started only once outside the loop, and the CrawlerProcess instantiation is also outside the loop, otherwise every iteration would overwrite the previous instance of the CrawlerProcess.

like image 192
renatodvc Avatar answered Sep 27 '22 21:09

renatodvc


The process should be started after all of the spiders are set up like it can be seen here:

https://docs.scrapy.org/en/latest/topics/practices.html#running-multiple-spiders-in-the-same-process

In your case scenario, a little bit more of code would've helped but I suppose that setting up all the crawl processes for both of the spiders for all of the prodcts and then firing up the start() function.

if __name__ == '__main__':

    for a in range(len(inputlist)):
        process = CrawlerProcess(settings={
            "FEEDS": {
                "itemtmall.csv": {"format": "csv",
                                  'fields': ['product_name_tmall', 'product_price_tmall', 'product_discount_tmall'], },
                "itemjd.csv": {"format": "csv",
                               'fields': ['product_name_jd', 'product_price_jd', 'product_discount_jd'], },
        })

        process.crawl(tmallSpider)
        process.crawl(jdSpider)
    process.start()
like image 43
MartiONE Avatar answered Sep 27 '22 19:09

MartiONE