Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python scrapy ReactorNotRestartable substitute

I have been trying to make an app in Python using Scrapy that has the following functionality:

  • A rest api (I had made that using flask) listens to all requests to crawl/scrap and return the response after crawling.(the crawling part is short enough, so the connection can be keep-alive till crawling gets completed.)

I am able to do this using the following code:

items = []
def add_item(item):
    items.append(item)

# set up crawler
crawler = Crawler(SpiderClass,settings=get_project_settings())
crawler.signals.connect(add_item, signal=signals.item_passed)

# This is added to make the reactor stop, if I don't use this, the code stucks at reactor.run() line.
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) #@UndefinedVariable 
crawler.crawl(requestParams=requestParams)
# start crawling 
reactor.run() #@UndefinedVariable
return str(items)

Now the problem I am facing is after making the reactor stop (which seems necessary to me since I don't want to stuck to the reactor.run()). I couldn't accept the further request after first request. After first request gets completed, I got the following error:

Traceback (most recent call last):
  File "c:\python27\lib\site-packages\flask\app.py", line 1988, in wsgi_app
    response = self.full_dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1641, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "c:\python27\lib\site-packages\flask\app.py", line 1544, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "c:\python27\lib\site-packages\flask\app.py", line 1639, in full_dispatch_request
    rv = self.dispatch_request()
  File "c:\python27\lib\site-packages\flask\app.py", line 1625, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "F:\my_workspace\jobvite\jobvite\com\jobvite\web\RequestListener.py", line 38, in submitForm
    reactor.run() #@UndefinedVariable
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1193, in run
    self.startRunning(installSignalHandlers=installSignalHandlers)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning
    ReactorBase.startRunning(self)
  File "c:\python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning
    raise error.ReactorNotRestartable()
ReactorNotRestartable

Which is obvious, since we can not restart the reactor.

So my questions are:

1) How could I provide support for the next requests to crawl?

2) Is there any way to move to next line after reactor.run() without stopping it?

like image 500
sagar Avatar asked Sep 11 '16 08:09

sagar


1 Answers

Here is a simple solution to your problem

from flask import Flask
import threading
import subprocess
import sys
app = Flask(__name__) 

class myThread (threading.Thread):
    def __init__(self,target):
        threading.Thread.__init__(self)
        self.target = target
    def run(self):
        start_crawl()

def start_crawl():
    pid = subprocess.Popen([sys.executable, "start_request.py"])
    return


@app.route("/crawler/start") 
def start_req():
    print ":request"
    threadObj = myThread("run_crawler")
    threadObj.start()
    return "Your crawler is in running state" 
if (__name__ == "__main__"): 
    app.run(port = 5000)

In the above solution I assume that you are able to start your crawler from command line using command start_request.py file on shell/command line.

Now what we are doing is using threading in python to launch a new thread for each incoming request. Now you can easily able to run your crawler instance in parallel for each hit. Just control your number of threads using threading.activeCount()

like image 64
Dinesh Agrawal Avatar answered Sep 20 '22 22:09

Dinesh Agrawal