Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy crawler not completeing all loops in parse function

I have this code in my crawler

class StackSpider(InitSpider):
    name = 'stack'
    allowed_domains = ['sitepoint.com']
    start_urls = ["http://www.sitepoint.com"]
    start_page = "http://www.sitepoint.com"
    item = StackItem()

    def init_request(self):

        return Request(url=self.start_page, callback=self.parse)

    def parse(self, response):

        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="headline_area"]')
        items = []


        ivar = 1
        for site in sites[:5]:
            item = StackItem()
            log.msg(' LOOP' +str(ivar)+ '', level=log.ERROR)
            item['title'] ="yoo ma"
            request =  Request("http://www.sitepoint.com/getting-to-know-css3-selectors-structural-pseudo-classes/",  callback=self.test1)
            request.meta['item'] = item
            ivar = ivar + 1
            yield request


    def test1(self, response):
        log.msg('  LOOP 2 \n', level=log.ERROR)
        item = response.meta['item']
        item['desc'] = "test4"
        return item

I did it as per documentation but it only works on one loop. I mean i can only see in log on screen

LOOP1
LOOP2

It should be repeated 3 times

I tried the different combination of return and yield so

  1. return request and return item gives output LOOP1 LOOP2
  2. yield request and return item gives output LOOP1 LOOP1 LOOP1 LOOP2
  3. yield request and yield item gives output LOOP1 LOOP1 LOOP1 LOOP2
  4. return request and yield item gives output LOOP1 LOOP2

How can i get LOOP 1 LOOP2 LOOP1 LOOP2 AND so on

like image 681
user19140477031 Avatar asked Apr 21 '26 16:04

user19140477031


1 Answers

problem is in your loop

for site in sites[:5]:

you are requesting 1 same URL in loop multiple times.

Scrapy by default filter same requests and ignore them .

if you want to request on same URL multiple times you need to set dont_filter=True

            request = Request("http://www.sitepoint.com/getting-to-know-css3-selectors-structural-pseudo-classes/",
            dont_filter=True,
            callback=self.test1)

then It should be repeated 3 times

like image 190
akhter wahab Avatar answered Apr 23 '26 07:04

akhter wahab



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!