Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How run a scrapy spider programmatically like a simple script?

I created a Scrapy spider. But I wanna run it as a script. How I can do this. Now I am able to run by this command in terminal:

$ scrapy crawl book -o book.json

But I want to run it like a simple python script

enter image description here

like image 459
Ravi Siswaliya Avatar asked Dec 13 '17 12:12

Ravi Siswaliya


People also ask

How do you run the Scrapy spider script?

You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl . Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. The first utility you can use to run your spiders is scrapy. crawler.

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

Can Scrapy handle JavaScript?

Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.


1 Answers

You can run spider directly in python script without using project.

You have to use scrapy.crawler.CrawlerProcess or scrapy.crawler.CrawlerRunner
but I'm not sure if it has all functionality as in project.

See more in documentation: Common Practices

Or you can put your command in bash script on Linux or in .bat file on Windows.

BTW: on Linux you can add shebang in first line (#!/bin/bash) and set attribute "executable" -
ie. chmod +x your_script - and it will run as normal program.


Working example

#!/usr/bin/env python3

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['http://quotes.toqoute.com']

    #start_urls = []

    #def start_requests(self):
    #    for tag in self.tags:
    #        for page in range(self.pages):
    #            url = self.url_template.format(tag, page)
    #            yield scrapy.Request(url)

    def parse(self, response):
        print('url:', response.url)

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',
    'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()
like image 160
furas Avatar answered Oct 22 '22 13:10

furas