Here's a simple scrapy spider
import scrapy
class ExampleSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["https://www.dmoz.org"]
start_urls = ('https://www.dmoz.org/')
def parse(self,response):
yield scrapy.Request(self.start_urls[0],callback=self.parse2)
def parse2(self, response):
print(response.url)
When you run the program, parse2 method doesn't work and it doesn't print response.url. Then I found the solution to this in the thread below.
Why is my second request not getting called in the parse method of my scrapy spider
Its just that I needed to add dont_filter=True as argument in request method to make the parse2 function work.
yield scrapy.Request(self.start_urls[0],callback=self.parse2,dont_filter=True)
But in the examples given in scrapy documentation and many youtube tutorials, they never used dont_filter = True argument in scrapy.Request method and still their second parse functions works.
Take a look at this
def parse_page1(self, response):
return scrapy.Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
def parse_page2(self, response):
# this would log http://www.example.com/some_page.html
self.logger.info("Visited %s", response.url)
Why can't my spider work unless dont_filter=True is added ? What am I doing wrong ? What were the duplicate links that my spider had filtered in my first example ?
P.S. I could've resolved resolved this in the QA thread I posted above, But I'm not allowed to comment unless I have 50 reputation (poor me !!)
Scrapy filters your requests so you don't end up crawling the same pages, dont_filter quite literarily means to ignore this filter for this one request.
The parse method is in charge of processing the response and returning scraped data and/or more URLs to follow. Other Requests callbacks have the same requirements as the Spider class. This method, as well as any other Request callback, must return an iterable of Request and/or item objects.
It is an old topic, but for anyone who needs it, to pass an extra parameter you must use cb_kwargs , then call the parameter in the parse method. You can refer to this part of the documentation.
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
Short answer: You are making duplicate requests. Scrapy has built in duplicate filtering which is turned on by default. That's why the parse2
doesn't get called. When you add that dont_filter=True
, scrapy doesn't filter out the duplicate requests. So this time the request is processed.
Longer version:
In Scrapy, if you have set start_urls
or have the method start_requests()
defined, the spider automatically requests those urls and passes the response to the parse
method which is the default method used for parsing requests. Now you can yield new requests from here which will again be parsed by Scrapy. If you don't set a callback, the parse
method will be used again. If you set a callback, that callback will be used.
Scrapy also has a built in filter which stops duplicate requests. That is if Scrapy has already crawled a site and parsed the response, even if you yield another request with that url, scrapy will not process it.
In your case, you have the url in start_urls
. Scrapy starts with that url. It crawls the site and passes the response to parse
. Inside that parse
method, you again yield a request to that same url (which scrapy just processed) but this time with parse2
as the callback. When this request is yielded, scrapy sees this as a duplicate. So it ignores the request and never processes it. So no calls to parse2
is made.
If you want to control which urls should be processed and which callback to be used, I recommend you override the start_requests()
and return a list of scrapy.Request
instead of using the single start_urls
attribute.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With