Hello I'm new to using R to scrape data from the Internet and, sadly, know little about HTML and XML. Am trying to scrape each story link at the following parent page: http://www.who.int/csr/don/archive/year/2013/en/index.html. I don't care about any of the other links on the parent page, but need to create a table with a row for each story URL and columns for the corresponding URL, title of the story, date (it's always at the beginning of the first sentence following the story title), and then the rest of the text of the page (which can be several paragraphs of text).
I've tried to adapt the code at Scraping a wiki page for the "Periodic table" and all the links (and several related threads) but run into difficulties. Any advice or pointers would be gratefully appreciated. Here's what I've tried so far (with "?????" where I run into trouble):
rm(list=ls())
library(XML)
library(plyr)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
links = getNodeSet(doc, ?????)
df = ldply(doc, function(x) {
text = xmlValue(x)
if (text=='') text=NULL
symbol = xmlGetAttr(x, '?????')
link = xmlGetAttr(x, 'href')
if (!is.null(text) & !is.null(symbol) & !is.null(link))
data.frame(symbol, text, link)
} )
df = head(df, ?????)
You can xpathSApply
, (lapply equivalent), that search in your document given an Xpath.
library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
dates = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))
## dates hrefs
## 1 26 June 2013 /entity/csr/don/2013_06_26/en/index.html
## 2 23 June 2013 /entity/csr/don/2013_06_23/en/index.html
## 3 22 June 2013 /entity/csr/don/2013_06_22/en/index.html
## 4 17 June 2013 /entity/csr/don/2013_06_17/en/index.html
## story
## 1 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4 Middle East respiratory syndrome coronavirus (MERS-CoV) - update
dat$text = unlist(lapply(dat$hrefs,function(x)
{
url.story <- gsub('/entity','http://www.who.int',x)
texts <- xpathSApply(htmlParse(url.story),
'//*[@id="primary"]',xmlValue)
}))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With