I have been trying to make an app in Python using Scrapy
that has the following functionality:
I am able to do this using the following code:
items = []
def add_item(item):
items.append(item)
# set up crawler
crawler = Crawler(SpiderClass,settings=get_project_settings())
crawler.signals.connect(add_item, signal=signals.item_passed)
# This is added to make the reactor stop, if I don't use this, the code stucks at reactor.run() line.
crawler.signals.connect(reactor.stop, signal=signals.spider_closed) #@UndefinedVariable
crawler.crawl(requestParams=requestParams)
# start crawling
reactor.run() #@UndefinedVariable
return str(items)
Now the problem I am facing is after making the reactor stop (which seems necessary to me since I don't want to stuck to the reactor.run()
). I couldn't accept the further request after first request. After first request gets completed, I got the following error:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\flask\app.py", line 1988, in wsgi_app
response = self.full_dispatch_request()
File "c:\python27\lib\site-packages\flask\app.py", line 1641, in full_dispatch_request
rv = self.handle_user_exception(e)
File "c:\python27\lib\site-packages\flask\app.py", line 1544, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "c:\python27\lib\site-packages\flask\app.py", line 1639, in full_dispatch_request
rv = self.dispatch_request()
File "c:\python27\lib\site-packages\flask\app.py", line 1625, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "F:\my_workspace\jobvite\jobvite\com\jobvite\web\RequestListener.py", line 38, in submitForm
reactor.run() #@UndefinedVariable
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1193, in run
self.startRunning(installSignalHandlers=installSignalHandlers)
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning
ReactorBase.startRunning(self)
File "c:\python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning
raise error.ReactorNotRestartable()
ReactorNotRestartable
Which is obvious, since we can not restart the reactor.
So my questions are:
1) How could I provide support for the next requests to crawl?
2) Is there any way to move to next line after reactor.run() without stopping it?
Here is a simple solution to your problem
from flask import Flask
import threading
import subprocess
import sys
app = Flask(__name__)
class myThread (threading.Thread):
def __init__(self,target):
threading.Thread.__init__(self)
self.target = target
def run(self):
start_crawl()
def start_crawl():
pid = subprocess.Popen([sys.executable, "start_request.py"])
return
@app.route("/crawler/start")
def start_req():
print ":request"
threadObj = myThread("run_crawler")
threadObj.start()
return "Your crawler is in running state"
if (__name__ == "__main__"):
app.run(port = 5000)
In the above solution I assume that you are able to start your crawler from command line using command start_request.py file on shell/command line.
Now what we are doing is using threading in python to launch a new thread for each incoming request. Now you can easily able to run your crawler instance in parallel for each hit. Just control your number of threads using threading.activeCount()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With