Scrapy: What's the correct way to use start_requests()?

Tags:

scrapy

This is how my spider is set up

class CustomSpider(CrawlSpider):
    name = 'custombot'
    allowed_domains = ['www.domain.com']
    start_urls = ['http://www.domain.com/some-url']
    rules = ( 
              Rule(SgmlLinkExtractor(allow=r'.*?something/'), callback='do_stuff', follow=True),
            )

    def start_requests(self):
        return Request('http://www.domain.com/some-other-url', callback=self.do_something_else)

It goes to /some-other-url but not /some-url. What is wrong here? The url specified in start_urls are the ones that need links extracted and sent through the rules filter, where as the ones in start_requests are sent directly to the item parser so it doesn't need to pass through the rules filters.

888

asked Feb 11 '14 12:02

Crypto

1 Answers

From the documentation for start_requests, overriding start_requests means that the urls defined in start_urls are ignored.

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests.
[...]
If you want to change the Requests used to start scraping a domain, this is the method to override.

If you want to just scrape from /some-url, then remove start_requests. If you want to scrape from both, then add /some-url to the start_urls list.

125

answered Sep 20 '22 12:09

Talvalin

Related questions
                            
                                How to determine the learning rate and the variance in a gradient descent algorithm？
                            
                                set ipython's default scientific notation threshold
                            
                                Ubuntu add directory to Python path
                            
                                Parsing through edges in NetworkX graph
                            
                                Better Function Composition in Python
                            
                                matplotlib bitmap plot with vector text
                            
                                Get subset of most frequent dummy variables in pandas
                            
                                Writing Percentages in Excel Using Pandas
                            
                                Firefox + Selenium WebDriver and download a csv file automatically
                            
                                Better to add item to a set, or convert final list to a set?
                            
                                What happened on March 16th 1984?
                            
                                Python Not Finding Module
                            
                                How to show database errors to user in Django Admin
                            
                                Accessing the parent object's size parameters in kivy widgets
                            
                                Datetime Field Received a Naive Datetime
                            
                                How suitable is opting for RethinkDB instead of traditional SQL for a JSON API? [closed]
                            
                                Flask default error handler not being called
                            
                                close() never close connections in pymongo?
                            
                                Autopep8 not breaking long comment lines?
                            
                                Flask Button run Python without refreshing page?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With