Scrapy version: 1.0.5
I have searched for long time, but most of workarounds don't work in current Scrapy version.
My spider is defined in jingdong_spider.py, and the interface(learn it by Scrapy Documentation) to run spider is below:
# interface
def search(keyword):
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
d = runner.crawl(JingdongSpider,keyword)
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
Then in temp.py I will call the search(keyword)
above to run spider.
Now the problem: I called search(keyword) once, and it worked well.But I called it twice, for instance,
in temp.py
search('iphone')
search('ipad2')
it reported:
Traceback (most recent call last): File "C:/Users/jiahao/Desktop/code/bbt_climb_plus/temp.py", line 7, in search('ipad2') File "C:\Users\jiahao\Desktop\code\bbt_climb_plus\bbt_climb_plus\spiders\jingdong_spider.py", line 194, in search reactor.run() # the script will block here until the crawling is finished File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1193, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning ReactorBase.startRunning(self) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable
The first search(keyword) succeeded, but the latter got wrong.
Could you help?
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
CrawlerProcess . This class will start a Twisted reactor for you, configuring the logging and setting shutdown handlers. This class is the one used by all Scrapy commands. Here's an example showing how to run a single spider with it. import scrapy from scrapy.crawler import CrawlerProcess class MySpider(scrapy.
In your code sample you are making calls to twisted.reactor starting it on every function call. This is not working because there is only one reactor per process and you cannot start it twice.
There are two ways to solve your problem, both described in documentation here. Either stick with CrawlerRunner
but move reactor.run()
outside your search()
function to ensure it is only called once. Or use CrawlerProcess
and simply call crawler_process.start()
. Second approach is easier, your code would look like this:
from scrapy.crawler import CrawlerProcess
from dirbot.spiders.dmoz import DmozSpider
def search(runner, keyword):
return runner.crawl(DmozSpider, keyword)
runner = CrawlerProcess()
search(runner, "alfa")
search(runner, "beta")
runner.start()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With