Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to properly use Rules, restrict_xpaths to crawl and parse URLs with scrapy?

I am trying to program a crawl spider to crawl RSS feeds of a website and then parsing the meta tags of the article.

The first RSS page is a page that displays the RSS categories. I managed to extract the link because the tag is in a tag. It looks like this:

        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject1">subject1</a>
           </td>   
        </tr>
        <tr>
           <td class="xmlLink">
             <a href="http://feeds.example.com/subject2">subject2</a>
           </td>
        </tr>

Once you click that link it brings you the the articles for that RSS category that looks like this:

   <li class="regularitem">
    <h4 class="itemtitle">
        <a href="http://example.com/article1">article1</a>
    </h4>
  </li>
  <li class="regularitem">
     <h4 class="itemtitle">
        <a href="http://example.com/article2">article2</a>
     </h4>
  </li>

As You can see I can get the link with xpath again if I use the tag I want my crawler to go to the link inside that tag and parse the meta tags for me.

Here is my crawler code:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from tutorial.items import exampleItem


class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
        Rule(SgmlLinkExtractor(restrict_xpaths=('//h4[@class="itemtitle"]')), callback='parse_articles')]

    def parse_articles(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
           item = exampleItem()
           item['link'] = response.url
           item['meta_name'] =m.select('@name').extract()
           item['meta_value'] = m.select('@content').extract()
           items.append(item)
        return items

However this is the output when I run the crawler:

DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject1> (referer: http://example.com/tools/rss)
DEBUG: Crawled (200) <GET http://http://feeds.example.com/subject2> (referer: http://example.com/tools/rss)

What am I doing wrong here? I've been reading the documentation over and over again but I feel like I keep overlooking something. Any help would be appreciated.

EDIT: Added: items.append(item) . Had forgotten it in original post. EDIT: : I've tried this as well and it resulted in the same output:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from reuters.items import exampleItem
from scrapy.http import Request

class MetaCrawl(CrawlSpider):
    name = 'metaspider'
    start_urls = ['http://example.com/tools/rss'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'.*',], restrict_xpaths=('//td[@class="xmlLink"]')), follow=True),
             Rule(SgmlLinkExtractor(allow=[r'.*'], restrict_xpaths=('//h4[@class="itemtitle"]')),follow=True),]


    def parse(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//td[@class="xmlLink"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_link)


    def parse_link(self, response):       
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//h4[@class="itemtitle"]/a/@href')
        for m in meta:
            yield Request(m.extract(), callback = self.parse_again)    

    def parse_again(self, response):
        hxs = HtmlXPathSelector(response)
        meta = hxs.select('//meta')
        items = []
        for m in meta:
            item = exampleItem()
            item['link'] = response.url
            item['meta_name'] = m.select('@name').extract()
            item['meta_value'] = m.select('@content').extract()
            items.append(item)
        return items
like image 297
Marc Avatar asked Mar 03 '13 23:03

Marc


1 Answers

You've returned an empty items, you need to append item to items.
You can also yield item in the loop.

like image 124
kev Avatar answered Oct 27 '22 20:10

kev