I created a Scrapy spider. But I wanna run it as a script. How I can do this. Now I am able to run by this command in terminal:
$ scrapy crawl book -o book.json
But I want to run it like a simple python script
You can use the API to run Scrapy from a script, instead of the typical way of running Scrapy via scrapy crawl . Remember that Scrapy is built on top of the Twisted asynchronous networking library, so you need to run it inside the Twisted reactor. The first utility you can use to run your spiders is scrapy. crawler.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.
You can run spider directly in python script without using project.
You have to use scrapy.crawler.CrawlerProcess
or scrapy.crawler.CrawlerRunner
but I'm not sure if it has all functionality as in project.
See more in documentation: Common Practices
Or you can put your command in bash script on Linux or in .bat
file on Windows.
BTW: on Linux you can add shebang in first line (#!/bin/bash
) and set attribute "executable" -
ie. chmod +x your_script
- and it will run as normal program.
Working example
#!/usr/bin/env python3
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
allowed_domains = ['http://quotes.toqoute.com']
#start_urls = []
#def start_requests(self):
# for tag in self.tags:
# for page in range(self.pages):
# url = self.url_template.format(tag, page)
# yield scrapy.Request(url)
def parse(self, response):
print('url:', response.url)
# --- it runs without project and saves in `output.csv` ---
from scrapy.crawler import CrawlerProcess
c = CrawlerProcess({
'USER_AGENT': 'Mozilla/5.0',
'FEED_FORMAT': 'csv',
'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With