Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: How to run spider from other python script twice or more?

Scrapy version: 1.0.5

I have searched for long time, but most of workarounds don't work in current Scrapy version.

My spider is defined in jingdong_spider.py, and the interface(learn it by Scrapy Documentation) to run spider is below:

# interface
def search(keyword):
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    d = runner.crawl(JingdongSpider,keyword)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

Then in temp.py I will call the search(keyword) above to run spider.

Now the problem: I called search(keyword) once, and it worked well.But I called it twice, for instance,

in temp.py

search('iphone')
search('ipad2')

it reported:

Traceback (most recent call last): File "C:/Users/jiahao/Desktop/code/bbt_climb_plus/temp.py", line 7, in search('ipad2') File "C:\Users\jiahao\Desktop\code\bbt_climb_plus\bbt_climb_plus\spiders\jingdong_spider.py", line 194, in search reactor.run() # the script will block here until the crawling is finished File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1193, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning ReactorBase.startRunning(self) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable

The first search(keyword) succeeded, but the latter got wrong.

Could you help?

like image 931
guo Avatar asked Apr 05 '16 06:04

guo


People also ask

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

How do you run a Scrapy spider from a Python script?

The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

What is CrawlerProcess?

CrawlerProcess . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands. Here's an example showing how to run a single spider with it. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.


1 Answers

In your code sample you are making calls to twisted.reactor starting it on every function call. This is not working because there is only one reactor per process and you cannot start it twice.

There are two ways to solve your problem, both described in documentation here. Either stick with CrawlerRunner but move reactor.run() outside your search() function to ensure it is only called once. Or use CrawlerProcess and simply call crawler_process.start(). Second approach is easier, your code would look like this:

from scrapy.crawler import CrawlerProcess
from dirbot.spiders.dmoz import DmozSpider

def search(runner, keyword):
    return runner.crawl(DmozSpider, keyword)

runner = CrawlerProcess()
search(runner, "alfa")
search(runner, "beta")
runner.start()
like image 163
Pawel Miech Avatar answered Sep 21 '22 06:09

Pawel Miech