Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: What's the correct way to use start_requests()?

Tags:

python

scrapy

This is how my spider is set up

class CustomSpider(CrawlSpider):
    name = 'custombot'
    allowed_domains = ['www.domain.com']
    start_urls = ['http://www.domain.com/some-url']
    rules = ( 
              Rule(SgmlLinkExtractor(allow=r'.*?something/'), callback='do_stuff', follow=True),
            )

    def start_requests(self):
        return Request('http://www.domain.com/some-other-url', callback=self.do_something_else)

It goes to /some-other-url but not /some-url. What is wrong here? The url specified in start_urls are the ones that need links extracted and sent through the rules filter, where as the ones in start_requests are sent directly to the item parser so it doesn't need to pass through the rules filters.

like image 888
Crypto Avatar asked Feb 11 '14 12:02

Crypto


People also ask

How do you run multiple spiders in a Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.

What does parse function do in Scrapy?

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.

How do you scrape data from Scrapy?

While working with Scrapy, one needs to create scrapy project. In Scrapy, always try to create one spider which helps to fetch data, so to create one, move to spider folder and create one python file over there. Create one spider with name gfgfetch.py python file. Move to the spider folder and create gfgfetch.py .

What is a spider in scraping?

Spider is a smart point-and-click web scraping tool. With Spider, you can turn websites into organized data, download it as JSON or spreadsheet. There's no coding experience or configuration time involved, simply open the chrome extension and start clicking. 2.0. July 26, 2022.


1 Answers

From the documentation for start_requests, overriding start_requests means that the urls defined in start_urls are ignored.

This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified. If particular URLs are specified, the make_requests_from_url() is used instead to create the Requests.
[...]
If you want to change the Requests used to start scraping a domain, this is the method to override.

If you want to just scrape from /some-url, then remove start_requests. If you want to scrape from both, then add /some-url to the start_urls list.

like image 125
Talvalin Avatar answered Sep 20 '22 12:09

Talvalin