I'm working with scrapy. I want to loop through a db table and grab the starting page for each scrape (random_form_page), then yield a request for each start page. Please note that I am hitting an api to get a proxy with the initial request. I want to set up each request to have its own proxy, so using the callback model I have:
def start_requests(self):
for x in xrange(8):
random_form_page = session.query(....
PR = Request(
'htp://my-api',
headers=self.headers,
meta={'newrequest': Request(random_form_page, headers=self.headers)},
callback=self.parse_PR
)
yield PR
I notice:
[scrapy] DEBUG: Filtered duplicate request: <GET 'htp://my-api'> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
In my code I can see that although it loops through 8 times it only yields a request for the first page. The others I assume are being filtered out. I've looked at http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class but still unsure how to turn off this filtering action. How can I turn off the filtering?
use
dont_filter = True in Request object
def start_requests(self):
for x in xrange(8):
random_form_page = session.query(....
PR = Request(
'htp://my-api',
headers=self.headers,
meta={'newrequest': Request(random_form_page, headers=self.headers)},
callback=self.parse_PR,
dont_filter = True
)
yield PR
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With