Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Starting Scrapy from a Django view

My experience with Scrapy is limited, and each time I use it, it's always through the terminal's commands. How can I get my form data (a url to be scraped) from my django template to communicate with scrapy to start doing scraping? So far, I've only thought of is to get the form's returned data from django's views and then try to reach into the spider.py in scrapy's directory to add the form data's url to the spider's start_urls. From there, I don't really know how to trigger the actual crawling since I'm used to doing it strictly through my terminal with commands like "scrapy crawl dmoz". Thanks.

tiny edit: Just discovered scrapyd... I think I may be headed in the right direction with this.

like image 853
pyramidface Avatar asked Nov 14 '14 02:11

pyramidface


People also ask

Can we use Scrapy with Django?

Scrapy and Django The class also provides a method to create and populate the Django model instance with the item data from the pipeline. This library allows us to integrate Scrapy and Django easily and means we can also have access to all the data directly in the admin!

How do you run a Scrapy in a script?

Basic ScriptThe key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

What is a spider in Scrapy?

Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).


1 Answers

You've actually answered it with an edit. The best option would be to setup scrapyd service and make an API call to schedule.json to trigger a scraping job to run.

To make that API http call, you can either use urllib2/requests, or use a wrapper around scrapyd API - python-scrapyd-api:

from scrapyd_api import ScrapydAPI

scrapyd = ScrapydAPI('http://localhost:6800')
scrapyd.schedule('project_name', 'spider_name')

If we put aside scrapyd and try to run the spider from the view, it will block the request until the twisted reactor would stop - therefore, it is not really an option.

You can though, start using celery (in tandem with django_celery) - define a task that would run your Scrapy spider and call the task from your django view. This way, you would put the task on the queue and would not have a user waiting for crawling to be finished.


Also, take a look at the django-dynamic-scraper package:

Django Dynamic Scraper (DDS) is an app for Django build on top of the scraping framework Scrapy. While preserving many of the features of Scrapy it lets you dynamically create and manage spiders via the Django admin interface.

like image 72
alecxe Avatar answered Oct 06 '22 00:10

alecxe