Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping a web page, links on a page, and forming a table with R

Hello I'm new to using R to scrape data from the Internet and, sadly, know little about HTML and XML. Am trying to scrape each story link at the following parent page: http://www.who.int/csr/don/archive/year/2013/en/index.html. I don't care about any of the other links on the parent page, but need to create a table with a row for each story URL and columns for the corresponding URL, title of the story, date (it's always at the beginning of the first sentence following the story title), and then the rest of the text of the page (which can be several paragraphs of text).

I've tried to adapt the code at Scraping a wiki page for the "Periodic table" and all the links (and several related threads) but run into difficulties. Any advice or pointers would be gratefully appreciated. Here's what I've tried so far (with "?????" where I run into trouble):

rm(list=ls())
library(XML)
library(plyr) 

url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)

links = getNodeSet(doc, ?????)

df = ldply(doc, function(x) {
  text = xmlValue(x)
  if (text=='') text=NULL

  symbol = xmlGetAttr(x, '?????')
  link = xmlGetAttr(x, 'href')
  if (!is.null(text) & !is.null(symbol) & !is.null(link))
    data.frame(symbol, text, link)
} )

df = head(df, ?????)
like image 700
user2535366 Avatar asked Mar 23 '23 08:03

user2535366


1 Answers

You can xpathSApply, (lapply equivalent), that search in your document given an Xpath.

library(XML)
url = 'http://www.who.int/csr/don/archive/year/2013/en/index.html'
doc <- htmlParse(url)
data.frame(
  dates =  xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlValue),
  hrefs = xpathSApply(doc, '//*[@class="auto_archive"]/li/a',xmlGetAttr,'href'),
  story = xpathSApply(doc, '//*[@class="link_info"]/text()',xmlValue))

 ##               dates                                                hrefs
## 1      26 June 2013             /entity/csr/don/2013_06_26/en/index.html
## 2      23 June 2013             /entity/csr/don/2013_06_23/en/index.html
## 3      22 June 2013             /entity/csr/don/2013_06_22/en/index.html
## 4      17 June 2013             /entity/csr/don/2013_06_17/en/index.html

##                                                                                    story
## 1                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 2                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 3                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update
## 4                       Middle East respiratory syndrome coronavirus (MERS-CoV) - update

EDIT: add the text of each story

dat$text = unlist(lapply(dat$hrefs,function(x)
  {
    url.story <- gsub('/entity','http://www.who.int',x)
    texts <- xpathSApply(htmlParse(url.story), 
                         '//*[@id="primary"]',xmlValue)
    }))
like image 79
agstudy Avatar answered Apr 06 '23 02:04

agstudy