I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work...
I am using the following rule:
rules = ( Rule(SgmlLinkExtractor(tags=('link',), attrs=False), follow=True, callback='parse_article'), )
(having in mind that rss links are located in the link tag).
I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ...
Any help is welcome, Thanks in advance
CrawlSpider rules don't work that way. You'll probably need to subclass BaseSpider and implement your own link extraction in your spider callback. For example:
from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import XmlXPathSelector
class MySpider(BaseSpider):
name = 'myspider'
def parse(self, response):
xxs = XmlXPathSelector(response)
links = xxs.select("//link/text()").extract()
return [Request(x, callback=self.parse_link) for x in links]
You can also try the XPath in the shell, by running for example:
scrapy shell http://blog.scrapy.org/rss.xml
And then typing in the shell:
>>> xxs.select("//link/text()").extract()
[u'http://blog.scrapy.org',
u'http://blog.scrapy.org/new-bugfix-release-0101',
u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release']
There's an XMLFeedSpider
one can use nowadays.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With