I created a Scrapy spider. But I wanna run it as a script. How I can do this. Now I am able to run by this command in terminal: <pre class="prettyprint"><code>$ scrapy crawl book -o book.json </code></pre> But I want to run it like a simple python script <img src="https://i.stack.imgur.com/kWDRK.png" alt="enter image description here">

You can run spider directly in python script without using project. You have to use <code>scrapy.crawler.CrawlerProcess</code> or <code>scrapy.crawler.CrawlerRunner</code> but I'm not sure if it has all functionality as in project. See more in documentation: Common Practices Or you can put your command in bash script on Linux or in <code>.bat</code> file on Windows. BTW: on Linux you can add shebang in first line (<code>#!/bin/bash</code>) and set attribute "executable" - ie. <code>chmod +x your_script</code> - and it will run as normal program. <hr> Working example <pre class="prettyprint"><code>#!/usr/bin/env python3 import scrapy class MySpider(scrapy.Spider): name = 'myspider' allowed_domains = ['http://quotes.toqoute.com'] #start_urls = [] #def start_requests(self): # for tag in self.tags: # for page in range(self.pages): # url = self.url_template.format(tag, page) # yield scrapy.Request(url) def parse(self, response): print('url:', response.url) # --- it runs without project and saves in `output.csv` --- from scrapy.crawler import CrawlerProcess c = CrawlerProcess({ 'USER_AGENT': 'Mozilla/5.0', 'FEED_FORMAT': 'csv', 'FEED_URI': 'output.csv', }) c.crawl(MySpider) c.start() </code></pre>

How run a scrapy spider programmatically like a simple script?

Tags:

python

web-scraping

scrapy

I created a Scrapy spider. But I wanna run it as a script. How I can do this. Now I am able to run by this command in terminal:

$ scrapy crawl book -o book.json

But I want to run it like a simple python script

enter image description here

459

asked Dec 13 '17 12:12

Ravi Siswaliya

1 Answers

You can run spider directly in python script without using project.

You have to use scrapy.crawler.CrawlerProcess or scrapy.crawler.CrawlerRunner
but I'm not sure if it has all functionality as in project.

See more in documentation: Common Practices

Or you can put your command in bash script on Linux or in .bat file on Windows.

BTW: on Linux you can add shebang in first line (#!/bin/bash) and set attribute "executable" -
ie. chmod +x your_script - and it will run as normal program.

Working example

#!/usr/bin/env python3

import scrapy

class MySpider(scrapy.Spider):

    name = 'myspider'

    allowed_domains = ['http://quotes.toqoute.com']

    #start_urls = []

    #def start_requests(self):
    #    for tag in self.tags:
    #        for page in range(self.pages):
    #            url = self.url_template.format(tag, page)
    #            yield scrapy.Request(url)

    def parse(self, response):
        print('url:', response.url)

# --- it runs without project and saves in `output.csv` ---

from scrapy.crawler import CrawlerProcess

c = CrawlerProcess({
    'USER_AGENT': 'Mozilla/5.0',
    'FEED_FORMAT': 'csv',
    'FEED_URI': 'output.csv',
})
c.crawl(MySpider)
c.start()

160

answered Oct 22 '22 13:10

furas

Related questions
                            
                                How to annotate Django view's methods?
                            
                                How to Add item to string_set on Dynamodb with Boto3
                            
                                BeautifulSoup.find_all() method not working with namespaced tags
                            
                                Python BeautifulSoup, iterating through tags and attributes
                            
                                Vim and python - jump to definition key binding
                            
                                ConfigParser - Create file if it doesn't exist
                            
                                Python decorators count function call
                            
                                Fitting a polynomial using np.polyfit in 3 dimensions
                            
                                Cannot chain find and find_all in BeautifulSoup
                            
                                Apache2 "Response header name '<!--' contains invalid characters, aborting request"
                            
                                What's the difference with opencv, python-opencv, and libopencv?
                            
                                How to iterate over this n-dimensional dataset?
                            
                                Looping over groups in a grouped dataframe
                            
                                Pass variables from Scala to Python in Databricks
                            
                                Getting labels from StringIndexer stages within pipeline in Spark (pyspark)
                            
                                Django delete cache with specific key_prefix
                            
                                Pandas, normalising json-per-line
                            
                                Matplotlib bar graph not drawing borders/edges
                            
                                Python - how to multiply characters in string by number after character
                            
                                How to fill pandas dataframe columns with random dictionary values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With