Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse RSS feed using XML packagin R

Tags:

r

xml

xml-parsing

I am trying to scrape and parse the following RSS feed http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml I have looked at other queries with respect to R and XML and have been unable to make any progress on my problem. The xml code for each entry

        <item>
     <title><![CDATA[Five Rockets Intercepted By Iron Drone Systems Over Be'er Sheva]]></title>
     <link>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</link>
     <description><![CDATA[<a href="http://www.haaretz.com/news/diplomacy-defense/live-blog-rockets-strike-tel-aviv-area-three-israelis-killed-in-attack-on-south-1.477960" target="_hplink">Haaretz reports</a> that five more rockets intercepted by Iron Dome systems over Be'er Sheva. In total, there have been 274 rockets fired and 105 intercepted. The IDF has attacked 250 targets in Gaza.]]></description>
     <guid>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</guid>
     <pubDate>2012-11-15T12:56:09-05:00</pubDate>
     <source url="http://huffingtonpost.com/rss/liveblog/liveblog-1213.xml">Huffingtonpost.com</source>
  </item>

For each entry/post I want to record "Date" (pubDate), "Title" (title), "Description" (full text cleaned). I have tried to use the xml package in R, but confess I am a bit of a newbie (little to no experience working with XML, but some R experience). The code I am working off of, and getting nowhere with is:

 library(XML)

 xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"

 # Use the xmlTreePares-function to parse xml file directly from the web

 xmlfile <- xmlTreeParse(xml.url)

# Use the xmlRoot-function to access the top node

xmltop = xmlRoot(xmlfile)

xmlName(xmltop)

names( xmltop[[ 1 ]] )

  title          link   description      language     copyright 
  "title"        "link" "description"    "language"   "copyright" 
 category     generator          docs          item          item 
  "category"   "generator"        "docs"        "item"        "item"

However, whenever I try to manipulate and try to manipulate the "title", or "description" information, I continually get errors. Any help troubleshooting this code, would be most appreciated.

Thanks, Thomas

like image 400
Thomas Avatar asked Nov 20 '12 06:11

Thomas


People also ask

How do I view an RSS feed in XML?

Click on Load, enter RSS feed URL and it will output XML data on the left side and HTML data on the right side. This tool also supports to convert RSS XML to JSON. Show activity on this post. If you view the feed in Firefox it'll read like an RSS reader, but if you view the source code you've got it in XML format.

Does RSS share content in XML format?

RSS is a Web content syndication format. Its name is an acronym for Really Simple Syndication. RSS is a dialect of XML. All RSS files must conform to the XML 1.0 specification, as published on the World Wide Web Consortium (W3C) website.


1 Answers

I am using the excellent Rcurl library and xpathSApply

This is script gives you 3 lists (title,pubdates and description)

library(RCurl)
library(XML)
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"
script  <- getURL(xml.url)
doc     <- xmlParse(script)
titles    <- xpathSApply(doc,'//item/title',xmlValue)
descriptions    <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)
like image 168
agstudy Avatar answered Oct 06 '22 10:10

agstudy