Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does adding dont_filter=True argument in scrapy.Request make my parsing method to work ?

Here's a simple scrapy spider

import scrapy

class ExampleSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["https://www.dmoz.org"]
    start_urls = ('https://www.dmoz.org/')

    def parse(self,response):
        yield scrapy.Request(self.start_urls[0],callback=self.parse2)

    def parse2(self, response):
        print(response.url)

When you run the program, parse2 method doesn't work and it doesn't print response.url. Then I found the solution to this in the thread below.

Why is my second request not getting called in the parse method of my scrapy spider

Its just that I needed to add dont_filter=True as argument in request method to make the parse2 function work.

yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)

But in the examples given in scrapy documentation and many youtube tutorials, they never used dont_filter = True argument in scrapy.Request method and still their second parse functions works.

Take a look at this

def parse_page1(self, response):
    return scrapy.Request("http://www.example.com/some_page.html",
                      callback=self.parse_page2)

def parse_page2(self, response):
    # this would log http://www.example.com/some_page.html
    self.logger.info("Visited %s", response.url)

Why can't my spider work unless dont_filter=True is added ? What am I doing wrong ? What were the duplicate links that my spider had filtered in my first example ?

P.S. I could've resolved resolved this in the QA thread I posted above, But I'm not allowed to comment unless I have 50 reputation (poor me !!)

like image 217
Uchiha Madara Avatar asked Aug 15 '16 08:08

Uchiha Madara


People also ask

What is Dont_filter Scrapy?

Scrapy filters your requests so you don't end up crawling the same pages, dont_filter quite literarily means to ignore this filter for this one request.

What does parse function do in Scrapy?

The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.

How do I pass parameters in Scrapy request?

It is an old topic, but for anyone who needs it, to pass an extra parameter you must use cb_kwargs , then call the parameter in the parse method. You can refer to this part of the documentation.

How do I make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.


1 Answers

Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default. That's why the parse2 doesn't get called. When you add that dont_filter=True, scrapy doesn't filter out the duplicate requests. So this time the request is processed.

Longer version:

In Scrapy, if you have set start_urls or have the method start_requests() defined, the spider automatically requests those urls and passes the response to the parse method which is the default method used for parsing requests. Now you can yield new requests from here which will again be parsed by Scrapy. If you don't set a callback, the parse method will be used again. If you set a callback, that callback will be used.

Scrapy also has a built in filter which stops duplicate requests. That is if Scrapy has already crawled a site and parsed the response, even if you yield another request with that url, scrapy will not process it.

In your case, you have the url in start_urls. Scrapy starts with that url. It crawls the site and passes the response to parse. Inside that parse method, you again yield a request to that same url (which scrapy just processed) but this time with parse2 as the callback. When this request is yielded, scrapy sees this as a duplicate. So it ignores the request and never processes it. So no calls to parse2 is made.

If you want to control which urls should be processed and which callback to be used, I recommend you override the start_requests() and return a list of scrapy.Request instead of using the single start_urls attribute.

like image 101
masnun Avatar answered Oct 11 '22 06:10

masnun