How does adding dont_filter=True argument in scrapy.Request make my parsing method to work ?

Tags:

Here's a simple scrapy spider

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["https://www.dmoz.org"]
    start_urls = ('https://www.dmoz.org/')

    def parse(self,response):
        yield scrapy.Request(self.start_urls[0],callback=self.parse2)

    def parse2(self, response):
        print(response.url)

When you run the program, parse2 method doesn't work and it doesn't print response.url. Then I found the solution to this in the thread below.

Why is my second request not getting called in the parse method of my scrapy spider

Its just that I needed to add dont_filter=True as argument in request method to make the parse2 function work.

yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)

But in the examples given in scrapy documentation and many youtube tutorials, they never used dont_filter = True argument in scrapy.Request method and still their second parse functions works.

Take a look at this

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

Why can't my spider work unless dont_filter=True is added ? What am I doing wrong ? What were the duplicate links that my spider had filtered in my first example ?

P.S. I could've resolved resolved this in the QA thread I posted above, But I'm not allowed to comment unless I have 50 reputation (poor me !!)

217

asked Aug 15 '16 08:08

Uchiha Madara

1 Answers

Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default. That's why the parse2 doesn't get called. When you add that dont_filter=True, scrapy doesn't filter out the duplicate requests. So this time the request is processed.

Longer version:

In Scrapy, if you have set start_urls or have the method start_requests() defined, the spider automatically requests those urls and passes the response to the parse method which is the default method used for parsing requests. Now you can yield new requests from here which will again be parsed by Scrapy. If you don't set a callback, the parse method will be used again. If you set a callback, that callback will be used.

Scrapy also has a built in filter which stops duplicate requests. That is if Scrapy has already crawled a site and parsed the response, even if you yield another request with that url, scrapy will not process it.

In your case, you have the url in start_urls. Scrapy starts with that url. It crawls the site and passes the response to parse. Inside that parse method, you again yield a request to that same url (which scrapy just processed) but this time with parse2 as the callback. When this request is yielded, scrapy sees this as a duplicate. So it ignores the request and never processes it. So no calls to parse2 is made.

If you want to control which urls should be processed and which callback to be used, I recommend you override the start_requests() and return a list of scrapy.Request instead of using the single start_urls attribute.

101

answered Oct 11 '22 06:10

masnun

Related questions
                            
                                Python Scrapy doesn't retry timeout connection
                            
                                HTTP Error 999: Request denied
                            
                                BeautifulSoup returns None even though the element exists
                            
                                Any Good Open Source Web Crawling Framework in C#
                            
                                Indy - IdHttp how to handle page redirects?
                            
                                What is the simplest way to programatically start a crawler in Scrapy >= 0.14
                            
                                Scrape using Beautiful Soup preserving &nbsp; entities
                            
                                how to resume wget mirroring website?
                            
                                Python - Selenium and XPATH to extract all rows from a table
                            
                                Trouble modifying the language option in selenium python bindings
                            
                                PHP: strip_tags - remove only certain tags (and their contents)?
                            
                                Logging to specific error log file in scrapy
                            
                                NTLM authentication with Scrapy for web scraping
                            
                                Using BeautifulSoup to extract the title of a link
                            
                                Rename downloaded files selenium
                            
                                Difference between text and innerHTML using Selenium
                            
                                Web scraping image inside canvas
                            
                                Can I fill web forms with Scrapy?
                            
                                Get innerHTML via Jsoup
                            
                                BeautifulSoup import error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does adding dont_filter=True argument in scrapy.Request make my parsing method to work ?

Tags:

web-scraping

scrapy

scrapy-spider

Uchiha Madara

People also ask

1 Answers

masnun

Recent Activity

Donate For Us