I've managed to code a very simple crawler with Scrapy, with these given constraints:
It runs well, except it doesn't implement rules if I add a callback to the first request!
Here is my code: (works but not properly, with a live example)
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapySpider.items import SPage
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class TestSpider4(CrawlSpider):
name = "spiderSO"
allowed_domains = ["cumulodata.com"]
start_urls = ["http://www.cumulodata.com"]
extractor = SgmlLinkExtractor()
def parse_start_url(self, response):
#3
print('----------manual call of',response)
self.parse_links(response)
print('----------manual call done')
# 1 return Request(self.start_urls[0]) # does not call parse_links(example.com)
# 2 return Request(self.start_urls[0],callback = self.parse_links) # does not call parse_links(example.com)
rules = (
Rule(extractor,callback='parse_links',follow=True),
)
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
print('----------- manual parsing links of',response.url)
links = hxs.select('//a')
for link in links:
title = link.select('@title')
url = link.select('@href').extract()[0]
meta={'title':title,}
yield Request(url, callback = self.parse_page,meta=meta)
def parse_page(self, response):
print('----------- parsing page: ',response.url)
hxs = HtmlXPathSelector(response)
item=SPage()
item['url'] = str(response.request.url)
item['title']=response.meta['title']
item['h1']=hxs.select('//h1/text()').extract()
yield item
I've tried solving this issue in 3 ways:
parse_links
- Same issueparse_links
after scraping the start url, by implementing parse_start_url
, function does not get calledHere are the logs:
----------manual call of <200 http://www.cumulodata.com>)
----------manual call done
#No '----------- manual parsing links', so `parse_links` is never called!
Versions
Here's a scraper that works perfectly:
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from scrapySpider.items import SPage
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class TestSpider4(CrawlSpider):
name = "spiderSO"
allowed_domains = ["cumulodata.com"]
start_urls = ["http://www.cumulodata.com/"]
extractor = SgmlLinkExtractor()
rules = (
Rule(extractor,callback='parse_links',follow=True),
)
def parse_start_url(self, response):
list(self.parse_links(response))
def parse_links(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
for link in links:
title = ''.join(link.select('./@title').extract())
url = ''.join(link.select('./@href').extract())
meta={'title':title,}
cleaned_url = "%s/?1" % url if not '/' in url.partition('//')[2] else "%s?1" % url
yield Request(cleaned_url, callback = self.parse_page, meta=meta,)
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
item=SPage()
item['url'] = response.url
item['title']=response.meta['title']
item['h1']=hxs.select('//h1/text()').extract()
return item
Changes:
Implemented parse_start_url
- Unfortunately, when you specify a callback for the first request, rules are not executed. This is inbuilt into Scrapy, and we can only manage this with a workaround. So we do a list(self.parse_links(response))
inside this function. Why the list()
? Because parse_links
is a generator, and generators are lazy. So we need to explicitly call it fully.
cleaned_url = "%s/?1" % url if not '/' in url.partition('//')[2] else "%s?1" % url
- There are a couple of things going on here:
a. We're adding '/?1' to the end of the URL - Since parse_links
returns duplicate URLs, Scrapy filters them out. An easier way to avoid that is to pass dont_filter=True
to Request(). However, all your pages are interlinked (back to index from pageAA, etc.) and a dont_filter
here results in too many duplicate requests & items.
b. if not '/' in url.partition('//')[2]
- Again, this is because of the linking in your website. One of the internal links is to 'www.cumulodata.com' and another to 'www.cumulodata.com/'. Since we're explicitly adding a mechanism to allow duplicates, this was resulting in one extra item. Since we needed perfect, I implemented this hack.
title = ''.join(link.select('./@title').extract())
- You don't want to return the node, but the data. Also: ''.join(list) is better than list[0] in case of an empty list.
Congrats on creating a test website which posed a curious problem - Duplicates are both necessary as well as unwanted!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With