I am trying to scrape and parse the following RSS feed http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml I have looked at other queries with respect to R and XML and have been unable to make any progress on my problem. The xml code for each entry
<item>
<title><![CDATA[Five Rockets Intercepted By Iron Drone Systems Over Be'er Sheva]]></title>
<link>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</link>
<description><![CDATA[<a href="http://www.haaretz.com/news/diplomacy-defense/live-blog-rockets-strike-tel-aviv-area-three-israelis-killed-in-attack-on-south-1.477960" target="_hplink">Haaretz reports</a> that five more rockets intercepted by Iron Dome systems over Be'er Sheva. In total, there have been 274 rockets fired and 105 intercepted. The IDF has attacked 250 targets in Gaza.]]></description>
<guid>http://www.huffingtonpost.co.uk/2012/11/15/tel-aviv-gaza-rocket_n_2138159.html#2_five-rockets-intercepted-by-iron-drone-systems-over-beer-sheva</guid>
<pubDate>2012-11-15T12:56:09-05:00</pubDate>
<source url="http://huffingtonpost.com/rss/liveblog/liveblog-1213.xml">Huffingtonpost.com</source>
</item>
For each entry/post I want to record "Date" (pubDate), "Title" (title), "Description" (full text cleaned). I have tried to use the xml package in R, but confess I am a bit of a newbie (little to no experience working with XML, but some R experience). The code I am working off of, and getting nowhere with is:
library(XML)
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"
# Use the xmlTreePares-function to parse xml file directly from the web
xmlfile <- xmlTreeParse(xml.url)
# Use the xmlRoot-function to access the top node
xmltop = xmlRoot(xmlfile)
xmlName(xmltop)
names( xmltop[[ 1 ]] )
title link description language copyright
"title" "link" "description" "language" "copyright"
category generator docs item item
"category" "generator" "docs" "item" "item"
However, whenever I try to manipulate and try to manipulate the "title", or "description" information, I continually get errors. Any help troubleshooting this code, would be most appreciated.
Thanks, Thomas
Click on Load, enter RSS feed URL and it will output XML data on the left side and HTML data on the right side. This tool also supports to convert RSS XML to JSON. Show activity on this post. If you view the feed in Firefox it'll read like an RSS reader, but if you view the source code you've got it in XML format.
RSS is a Web content syndication format. Its name is an acronym for Really Simple Syndication. RSS is a dialect of XML. All RSS files must conform to the XML 1.0 specification, as published on the World Wide Web Consortium (W3C) website.
I am using the excellent Rcurl library and xpathSApply
This is script gives you 3 lists (title,pubdates and description)
library(RCurl)
library(XML)
xml.url <- "http://www.huffingtonpost.com/rss/liveblog/liveblog-1213.xml"
script <- getURL(xml.url)
doc <- xmlParse(script)
titles <- xpathSApply(doc,'//item/title',xmlValue)
descriptions <- xpathSApply(doc,'//item/description',xmlValue)
pubdates <- xpathSApply(doc,'//item/pubDate',xmlValue)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With