Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Follow RSS links

I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work...

I am using the following rule:


   rules = (
       Rule(SgmlLinkExtractor(tags=('link',), attrs=False),
           follow=True,
           callback='parse_article'),
       )

(having in mind that rss links are located in the link tag).

I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ...

Any help is welcome, Thanks in advance

like image 590
pour toi Avatar asked May 30 '10 14:05

pour toi


2 Answers

CrawlSpider rules don't work that way. You'll probably need to subclass BaseSpider and implement your own link extraction in your spider callback. For example:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import XmlXPathSelector

class MySpider(BaseSpider):
    name = 'myspider'

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        links = xxs.select("//link/text()").extract()
        return [Request(x, callback=self.parse_link) for x in links]

You can also try the XPath in the shell, by running for example:

scrapy shell http://blog.scrapy.org/rss.xml

And then typing in the shell:

>>> xxs.select("//link/text()").extract()
[u'http://blog.scrapy.org',
 u'http://blog.scrapy.org/new-bugfix-release-0101',
 u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release']
like image 196
Pablo Hoffman Avatar answered Sep 22 '22 08:09

Pablo Hoffman


There's an XMLFeedSpider one can use nowadays.

like image 31
opyate Avatar answered Sep 26 '22 08:09

opyate