Building a RESTful Flask API for Scrapy

Tags:

The API should allow arbitrary HTTP get requests containing URLs the user wants scraped, and then Flask should return the results of the scrape.

The following code works for the first http request, but after twisted reactor stops, it won't restart. I may not even be going about this the right way, but I just want to put a RESTful scrapy API up on Heroku, and what I have so far is all I can think of.

Is there a better way to architect this solution? Or how can I allow scrape_it to return without stopping twisted reactor (which can't be started again)?

from flask import Flask
import os
import sys
import json

from n_grams.spiders.n_gram_spider import NGramsSpider

# scrapy api
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals

app = Flask(__name__)


def scrape_it(url):
    items = []
    def add_item(item):
        items.append(item)

    runner = CrawlerRunner()

    d = runner.crawl(NGramsSpider, [url])
    d.addBoth(lambda _: reactor.stop()) # <<< TROUBLES HERE ???

    dispatcher.connect(add_item, signal=signals.item_passed)

    reactor.run(installSignalHandlers=0) # the script will block here until the crawling is finished


    return items

@app.route('/scrape/<path:url>')
def scrape(url):

    ret = scrape_it(url)

    return json.dumps(ret, ensure_ascii=False, encoding='utf8')


if __name__ == '__main__':
    PORT = os.environ['PORT'] if 'PORT' in os.environ else 8080

    app.run(debug=True, host='0.0.0.0', port=int(PORT))

303

asked Sep 22 '15 18:09

Josh.F

2 Answers

I think there is no a good way to create Flask-based API for Scrapy. Flask is not a right tool for that because it is not based on event loop. To make things worse, Twisted reactor (which Scrapy uses) can't be started/stopped more than once in a single thread.

Let's assume there is no problem with Twisted reactor and you can start and stop it. It won't make things much better because your scrape_it function may block for an extended period of time, and so you will need many threads/processes.

I think the way to go is to create an API using async framework like Twisted or Tornado; it will be more efficient than a Flask-based (or Django-based) solution because the API will be able to serve requests while Scrapy is running a spider.

Scrapy is based on Twisted, so using twisted.web or https://github.com/twisted/klein can be more straightforward. But Tornado is also not hard because you can make it use Twisted event loop.

There is a project called ScrapyRT which does something very similar to what you want to implement - it is an HTTP API for Scrapy. ScrapyRT is based on Twisted.

As an examle of Scrapy-Tornado integration check Arachnado - here is an example on how to integrate Scrapy's CrawlerProcess with Tornado's Application.

If you really want Flask-based API then it could make sense to start crawls in separate processes and/or use queue solution like Celery. This way you're loosing most of the Scrapy efficiency; if you go this way you can use requests + BeautifulSoup as well.

197

answered Nov 15 '22 19:11

Mikhail Korobov

I have been working on similar project last week, it's SEO service API, my workflow was like this:

The client send a request to Flask-based server with a URRL to scrape, and a callback url to notify the client when scrapping is done (client here is an other web app)
Run Scrapy in the background using Celery. The spider will save the data to the database.
The backgound service will notify the client by calling the callback url when the spider is done.

answered Nov 15 '22 20:11

ahmed

Related questions
                            
                                Setting timeout on selenium webdriver.PhantomJS
                            
                                How to mix bash with python
                            
                                Simple async example with tornado python
                            
                                Change color of "tab header" in ttk.Notebook
                            
                                Python Removing Non Latin Characters
                            
                                Are element-wise operations faster with NumPy functions than operators?
                            
                                lambda arguments unpack error
                            
                                Nested field serializer - Data missing
                            
                                Datetime objects with pandas mean function
                            
                                findContours and drawContours errors in opencv 3 beta/python
                            
                                Test if python Counter is contained in another Counter
                            
                                Django Model " has more than one ForeignKey to "
                            
                                reading gzipped csv file in python 3
                            
                                How do I use django rest framework to send a file in response?
                            
                                python cannot connect hiveserver2
                            
                                scrapy crawler caught exception reading instance data
                            
                                Problems installing lxml in Ubuntu
                            
                                Force garbage collection in Python to free memory
                            
                                cluster points after KMeans clustering (scikit learn)
                            
                                Is it valid to use conditional expressions for side effects?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Building a RESTful Flask API for Scrapy

Tags:

python

heroku

flask

scrapy

twisted

Josh.F

People also ask

2 Answers

Mikhail Korobov

ahmed

Recent Activity

Donate For Us