I have this code in my crawler
class StackSpider(InitSpider):
name = 'stack'
allowed_domains = ['sitepoint.com']
start_urls = ["http://www.sitepoint.com"]
start_page = "http://www.sitepoint.com"
item = StackItem()
def init_request(self):
return Request(url=self.start_page, callback=self.parse)
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//div[@class="headline_area"]')
items = []
ivar = 1
for site in sites[:5]:
item = StackItem()
log.msg(' LOOP' +str(ivar)+ '', level=log.ERROR)
item['title'] ="yoo ma"
request = Request("http://www.sitepoint.com/getting-to-know-css3-selectors-structural-pseudo-classes/", callback=self.test1)
request.meta['item'] = item
ivar = ivar + 1
yield request
def test1(self, response):
log.msg(' LOOP 2 \n', level=log.ERROR)
item = response.meta['item']
item['desc'] = "test4"
return item
I did it as per documentation but it only works on one loop. I mean i can only see in log on screen
LOOP1
LOOP2
It should be repeated 3 times
I tried the different combination of return and yield so
return request and return item gives output LOOP1 LOOP2yield request and return item gives output LOOP1 LOOP1 LOOP1 LOOP2yield request and yield item gives output LOOP1 LOOP1 LOOP1 LOOP2return request and yield item gives output LOOP1 LOOP2How can i get LOOP 1 LOOP2 LOOP1 LOOP2 AND so on
problem is in your loop
for site in sites[:5]:
you are requesting 1 same URL in loop multiple times.
Scrapy by default filter same requests and ignore them .
if you want to request on same URL multiple times you need to set dont_filter=True
request = Request("http://www.sitepoint.com/getting-to-know-css3-selectors-structural-pseudo-classes/",
dont_filter=True,
callback=self.test1)
then It should be repeated 3 times
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With