Scrapy

Question

I'm working with scrapy. I want to loop through a db table and grab the starting page for each scrape (random_form_page), then yield a request for each start page. Please note that I am hitting an api to get a proxy with the initial request. I want to set up each request to have its own proxy, so using the callback model I have:

def start_requests(self):
    for x in xrange(8): 
        random_form_page = session.query(....

        PR = Request(
            'htp://my-api',
            headers=self.headers,
            meta={'newrequest': Request(random_form_page,  headers=self.headers)},
            callback=self.parse_PR
        )
        yield PR

I notice:

[scrapy] DEBUG: Filtered duplicate request: <GET 'htp://my-api'> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)

In my code I can see that although it loops through 8 times it only yields a request for the first page. The others I assume are being filtered out. I've looked at http://doc.scrapy.org/en/latest/topics/settings.html#dupefilter-class but still unsure how to turn off this filtering action. How can I turn off the filtering?

MrPandav · Accepted Answer

use

dont_filter = True in Request object

def start_requests(self):
    for x in xrange(8): 
        random_form_page = session.query(....

        PR = Request(
            'htp://my-api',
            headers=self.headers,
            meta={'newrequest': Request(random_form_page,  headers=self.headers)},
            callback=self.parse_PR,
            dont_filter = True
        )
        yield PR

Scrapy - Filtered duplicate request

Tags:

python

user1592380

1 Answers

MrPandav

Recent Activity

Donate For Us