Scrapy

Question

I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work...

I am using the following rule:


   rules = (
       Rule(SgmlLinkExtractor(tags=('link',), attrs=False),
           follow=True,
           callback='parse_article'),
       )

(having in mind that rss links are located in the link tag).

I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ...

Any help is welcome, Thanks in advance

Pablo Hoffman · Accepted Answer

CrawlSpider rules don't work that way. You'll probably need to subclass BaseSpider and implement your own link extraction in your spider callback. For example:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import XmlXPathSelector

class MySpider(BaseSpider):
    name = 'myspider'

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        links = xxs.select("//link/text()").extract()
        return [Request(x, callback=self.parse_link) for x in links]

You can also try the XPath in the shell, by running for example:

scrapy shell http://blog.scrapy.org/rss.xml

And then typing in the shell:

>>> xxs.select("//link/text()").extract()
[u'http://blog.scrapy.org',
 u'http://blog.scrapy.org/new-bugfix-release-0101',
 u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release']

opyate · Answer

There's an XMLFeedSpider one can use nowadays.

Scrapy - Follow RSS links

Tags:

python

web-crawler

pour toi

2 Answers

Pablo Hoffman

opyate

Recent Activity

Donate For Us