Scrapy version: 1.0.5 I have searched for long time, but most of workarounds don't work in current Scrapy version. My spider is defined in jingdong_spider.py, and the interface(learn it by Scrapy Documentation) to run spider is below: <pre class="prettyprint"><code># interface def search(keyword): configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'}) runner = CrawlerRunner() d = runner.crawl(JingdongSpider,keyword) d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until the crawling is finished </code></pre> Then in temp.py I will call the <code>search(keyword)</code> above to run spider. Now the problem: I called search(keyword) once, and it worked well.But I called it twice, for instance, in temp.py <pre class="prettyprint"><code>search('iphone') search('ipad2') </code></pre> it reported: <blockquote> Traceback (most recent call last): File "C:/Users/jiahao/Desktop/code/bbt_climb_plus/temp.py", line 7, in search('ipad2') File "C:\Users\jiahao\Desktop\code\bbt_climb_plus\bbt_climb_plus\spiders\jingdong_spider.py", line 194, in search reactor.run() # the script will block here until the crawling is finished File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1193, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning ReactorBase.startRunning(self) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable </blockquote> The first search(keyword) succeeded, but the latter got wrong. Could you help?

In your code sample you are making calls to twisted.reactor starting it on every function call. This is not working because there is only one reactor per process and you cannot start it twice. There are two ways to solve your problem, both described in documentation here. Either stick with <code>CrawlerRunner</code> but move <code>reactor.run()</code> outside your <code>search()</code> function to ensure it is only called once. Or use <code>CrawlerProcess</code> and simply call <code>crawler_process.start()</code>. Second approach is easier, your code would look like this: <pre class="prettyprint"><code>from scrapy.crawler import CrawlerProcess from dirbot.spiders.dmoz import DmozSpider def search(runner, keyword): return runner.crawl(DmozSpider, keyword) runner = CrawlerProcess() search(runner, "alfa") search(runner, "beta") runner.start() </code></pre>

Scrapy: How to run spider from other python script twice or more？

Tags:

python

python-2.7

scrapy

twisted

Scrapy version: 1.0.5

I have searched for long time, but most of workarounds don't work in current Scrapy version.

My spider is defined in jingdong_spider.py, and the interface(learn it by Scrapy Documentation) to run spider is below:

# interface
def search(keyword):
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    d = runner.crawl(JingdongSpider,keyword)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

Then in temp.py I will call the search(keyword) above to run spider.

Now the problem: I called search(keyword) once, and it worked well.But I called it twice, for instance,

in temp.py

search('iphone')
search('ipad2')

it reported:

Traceback (most recent call last): File "C:/Users/jiahao/Desktop/code/bbt_climb_plus/temp.py", line 7, in search('ipad2') File "C:\Users\jiahao\Desktop\code\bbt_climb_plus\bbt_climb_plus\spiders\jingdong_spider.py", line 194, in search reactor.run() # the script will block here until the crawling is finished File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1193, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning ReactorBase.startRunning(self) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable

The first search(keyword) succeeded, but the latter got wrong.

Could you help?

931

asked Apr 05 '16 06:04

guo

1 Answers

In your code sample you are making calls to twisted.reactor starting it on every function call. This is not working because there is only one reactor per process and you cannot start it twice.

There are two ways to solve your problem, both described in documentation here. Either stick with CrawlerRunner but move reactor.run() outside your search() function to ensure it is only called once. Or use CrawlerProcess and simply call crawler_process.start(). Second approach is easier, your code would look like this:

from scrapy.crawler import CrawlerProcess
from dirbot.spiders.dmoz import DmozSpider

def search(runner, keyword):
    return runner.crawl(DmozSpider, keyword)

runner = CrawlerProcess()
search(runner, "alfa")
search(runner, "beta")
runner.start()

163

answered Sep 21 '22 06:09

Pawel Miech

Related questions
                            
                                Remove multiple values from [list] dictionary python
                            
                                Locating table with no id or class attributes
                            
                                Django ignore extra arguments on constructing model
                            
                                how to get Python XMLGenerator to output CDATA
                            
                                Replacing characters from string one to string two
                            
                                django restframework :getting NotImplementedError
                            
                                Simple Python String (Backward) Slicing
                            
                                Elegant way to replace values in pandas.DataFrame from another DataFrame
                            
                                Generate all combinations of nucleotide k-mers between range(i, j)
                            
                                Python Pandas: How can I group by and assign an id to all the items in a group?
                            
                                python - how does one include a variable in the doc string?
                            
                                why is 1e400 not an int?
                            
                                Python Requests Mock doesn't catch Timeout exception
                            
                                Django groups and permissions
                            
                                Django: Non-staff users can login to admin page
                            
                                Enumerable for negative ranges
                            
                                In SQLAlchemy, how does the dict update method interact with the ORM?
                            
                                Best way to implement numpy.sin(x) / x where x might contain 0
                            
                                How to flatten a list of tuples and remove the duplicates?
                            
                                How to convert an InMemoryUploadedFile in django to a fomat for flickr API?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With