Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to integrate Flask & Scrapy?

Tags:

I'm using scrapy to get data and I want to use flask web framework to show the results in webpage. But I don't know how to call the spiders in the flask app. I've tried to use CrawlerProcess to call my spiders,but I got the error like this :

ValueError ValueError: signal only works in main thread  Traceback (most recent call last) File "/Library/Python/2.7/site-packages/flask/app.py", line 1836, in __call__ return self.wsgi_app(environ, start_response) File "/Library/Python/2.7/site-packages/flask/app.py", line 1820, in wsgi_app response = self.make_response(self.handle_exception(e)) File "/Library/Python/2.7/site-packages/flask/app.py", line 1403, in handle_exception reraise(exc_type, exc_value, tb) File "/Library/Python/2.7/site-packages/flask/app.py", line 1817, in wsgi_app response = self.full_dispatch_request() File "/Library/Python/2.7/site-packages/flask/app.py", line 1477, in full_dispatch_request rv = self.handle_user_exception(e) File "/Library/Python/2.7/site-packages/flask/app.py", line 1381, in handle_user_exception reraise(exc_type, exc_value, tb) File "/Library/Python/2.7/site-packages/flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "/Library/Python/2.7/site-packages/flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/Users/Rabbit/PycharmProjects/Flask_template/FlaskTemplate.py", line 102, in index process = CrawlerProcess() File "/Library/Python/2.7/site-packages/scrapy/crawler.py", line 210, in __init__ install_shutdown_handlers(self._signal_shutdown) File "/Library/Python/2.7/site-packages/scrapy/utils/ossignal.py", line 21, in install_shutdown_handlers reactor._handleSignals() File "/Library/Python/2.7/site-packages/twisted/internet/posixbase.py", line 295, in _handleSignals _SignalReactorMixin._handleSignals(self) File "/Library/Python/2.7/site-packages/twisted/internet/base.py", line 1154, in _handleSignals signal.signal(signal.SIGINT, self.sigInt) ValueError: signal only works in main thread 

My scrapy code like this:

class EPGD(Item):  genID = Field() genID_url = Field() taxID = Field() taxID_url = Field() familyID = Field() familyID_url = Field() chromosome = Field() symbol = Field() description = Field()  class EPGD_spider(Spider):     name = "EPGD"     allowed_domains = ["epgd.biosino.org"]     term = "man"     start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]  db = DB_Con() collection = db.getcollection(name, term)  def parse(self, response):     sel = Selector(response)     sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')     url_list = []     base_url = "http://epgd.biosino.org/EPGD"      for site in sites:         item = EPGD()         item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())         item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]         item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())         item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())         item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())         item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]         item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())         item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())         item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())         self.collection.update({"genID":item['genID']}, dict(item),  upsert=True)         yield item      sel_tmp = Selector(response)     link = sel_tmp.xpath('//span[@id="quickPage"]')      for site in link:         url_list.append(site.xpath('a/@href').extract())      for i in range(len(url_list[0])):         if cmp(url_list[0][i], "#") == 0:             if i+1 < len(url_list[0]):                 print url_list[0][i+1]                 actual_url = "http://epgd.biosino.org/EPGD/search/" + url_list[0][i+1]                 yield Request(actual_url, callback=self.parse)                 break             else:                 print "The index is out of range!" 

My flask code like this:

@app.route('/', methods=['GET', 'POST']) def index():     process = CrawlerProcess()     process.crawl(EPGD_spider)     return redirect(url_for('details'))   @app.route('/details', methods = ['GET']) def epgd():     if request.method == 'GET':         results = db['EPGD_test'].find()         json_results= []         for result in results:             json_results.append(result)         return toJson(json_results) 

How can I call my scrapy spiders when using flask web framework?

like image 394
Coding_Rabbit Avatar asked Apr 03 '16 10:04

Coding_Rabbit


People also ask

How do I run a flask API?

To run the application, use the flask command or python -m flask . You need to tell the Flask where your application is with the --app option. As a shortcut, if the file is named app.py or wsgi.py , you don't have to use --app . See Command Line Interface for more details.


2 Answers

Adding HTTP server in front of your spiders is not that easy. There are couple of options.

1. Python subprocess

If you are really limited to Flask, if you can't use anything else, only way to integrate Scrapy with Flask is by launching external process for every spider crawl as other answer recommends (note that your subprocess needs to be spawned in proper Scrapy project directory).

Directory structure for all examples should look like this, I'm using dirbot test project

> tree -L 1                                                                                                                                                                ├── dirbot ├── README.rst ├── scrapy.cfg ├── server.py └── setup.py 

Here's code sample to launch Scrapy in new process:

# server.py import subprocess  from flask import Flask app = Flask(__name__)  @app.route('/') def hello_world():     """     Run spider in another process and store items in file. Simply issue command:      > scrapy crawl dmoz -o "output.json"      wait for  this command to finish, and read output.json to client.     """     spider_name = "dmoz"     subprocess.check_output(['scrapy', 'crawl', spider_name, "-o", "output.json"])     with open("output.json") as items_file:         return items_file.read()  if __name__ == '__main__':     app.run(debug=True) 

Save above as server.py and visit localhost:5000, you should be able to see items scraped.

2. Twisted-Klein + Scrapy

Other, better way is using some existing project that integrates Twisted with Werkzeug and displays API similar to Flask, e.g. Twisted-Klein. Twisted-Klein would allow you to run your spiders asynchronously in same process as your web server. It's better in that it won't block on every request and it allows you to simply return Scrapy/Twisted deferreds from HTTP route request handler.

Following snippet integrates Twisted-Klein with Scrapy, note that you need to create your own base class of CrawlerRunner so that crawler will collect items and return them to caller. This option is bit more advanced, you're running Scrapy spiders in same process as Python server, items are not stored in file but stored in memory (so there is no disk writing/reading as in previous example). Most important thing is that it's asynchronous and it's all running in one Twisted reactor.

# server.py import json  from klein import route, run from scrapy import signals from scrapy.crawler import CrawlerRunner  from dirbot.spiders.dmoz import DmozSpider   class MyCrawlerRunner(CrawlerRunner):     """     Crawler object that collects items and returns output after finishing crawl.     """     def crawl(self, crawler_or_spidercls, *args, **kwargs):         # keep all items scraped         self.items = []          # create crawler (Same as in base CrawlerProcess)         crawler = self.create_crawler(crawler_or_spidercls)          # handle each item scraped         crawler.signals.connect(self.item_scraped, signals.item_scraped)          # create Twisted.Deferred launching crawl         dfd = self._crawl(crawler, *args, **kwargs)          # add callback - when crawl is done cal return_items         dfd.addCallback(self.return_items)         return dfd      def item_scraped(self, item, response, spider):         self.items.append(item)      def return_items(self, result):         return self.items   def return_spider_output(output):     """     :param output: items scraped by CrawlerRunner     :return: json with list of items     """     # this just turns items into dictionaries     # you may want to use Scrapy JSON serializer here     return json.dumps([dict(item) for item in output])   @route("/") def schedule(request):     runner = MyCrawlerRunner()     spider = DmozSpider()     deferred = runner.crawl(spider)     deferred.addCallback(return_spider_output)     return deferred   run("localhost", 8080) 

Save above in file server.py and locate it in your Scrapy project directory, now open localhost:8080, it will launch dmoz spider and return items scraped as json to browser.

3. ScrapyRT

There are some problems arising when you try to add HTTP app in front of your spiders. For example you need to handle spider logs sometimes (you may need them in some cases), you need to handle spider exceptions somehow etc. There are projects that allow you to add HTTP API to spiders in an easier way, e.g. ScrapyRT. This is an app that adds HTTP server to your Scrapy spiders and handles all the problems for you (e.g. handling logging, handling spider errors etc).

So after installing ScrapyRT you only need to do:

> scrapyrt  

in your Scrapy project directory, and it will launch HTTP server listening for requests for you. You then visit http://localhost:9080/crawl.json?spider_name=dmoz&url=http://alfa.com and it should launch your spider for you crawling url given.

Disclaimer: I'm one of the authors of ScrapyRt.

like image 196
Pawel Miech Avatar answered Sep 24 '22 18:09

Pawel Miech


There is at least one more way to do it which was not presented here yet, namely using the crochet library. To demonstrate, we create a minimal flask app returning JSON output plus a modified version of a basic example spider.

flask_app.py:

import crochet crochet.setup()  # initialize crochet before further imports  from flask import Flask, jsonify from scrapy import signals from scrapy.crawler import CrawlerRunner from scrapy.signalmanager import dispatcher  from myproject.spiders import example   app = Flask(__name__) output_data = [] crawl_runner = CrawlerRunner() # crawl_runner = CrawlerRunner(get_project_settings()) if you want to apply settings.py   @app.route("/scrape") def scrape():     # run crawler in twisted reactor synchronously     scrape_with_crochet()      return jsonify(output_data)   @crochet.wait_for(timeout=60.0) def scrape_with_crochet():     # signal fires when single item is processed     # and calls _crawler_result to append that item     dispatcher.connect(_crawler_result, signal=signals.item_scraped)     eventual = crawl_runner.crawl(         example.ToScrapeSpiderXPath)     return eventual  # returns a twisted.internet.defer.Deferred   def _crawler_result(item, response, spider):     """     We're using dict() to decode the items.     Ideally this should be done using a proper export pipeline.     """     output_data.append(dict(item))   if __name__=='__main__':     app.run('0.0.0.0', 8080)  

spiders/example.py:

import scrapy   class MyItem(scrapy.Item):     text = scrapy.Field()     author = scrapy.Field()   class ToScrapeSpiderXPath(scrapy.Spider):     name = 'toscrape-xpath'     start_urls = [         'http://quotes.toscrape.com/',     ]      def parse(self, response):         for quote in response.xpath('//div[@class="quote"]'):             return MyItem(                 text=quote.xpath('./span[@class="text"]/text()').extract_first(),                 author=quote.xpath('.//small[@class="author"]/text()').extract_first())          next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()         if next_page_url is not None:             return scrapy.Request(response.urljoin(next_page_url))  

This whole setup is done in a synchronously way, that means /scrape won't return anything until the crawling process is done. Here's some additional information from the crochet documentation:

Setup: Crochet does a number of things for you as part of setup. Most significantly, it runs Twisted’s reactor in a thread it manages.

@wait_for: Blocking calls into Twisted (...) When the decorated function is called, the code will not run in the calling thread, but rather in the reactor thread.
The function blocks until a result is available from the code running in the Twisted thread.

This solution is inspired by the following 2 posts:
Execute Scrapy spiders in a Flask web application
Get Scrapy crawler output/results in script file function

Note that this is a very prototype like approach as for example output_data will keep it's state after a request. If you're just looking for a way to start this might be do fine.

like image 30
nichoio Avatar answered Sep 22 '22 18:09

nichoio