Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing a tag within readHTMLTable in the XML package

Tags:

r

I'm trying to scrape data from the table at the following url:

http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033

The problem is the superscripts contained within the

<sup> </sup>

tags. When I use the following code (admittedly not very elegant)

url.overview <- "http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033"
overview <- readHTMLTable(overview)
overview <- overview[[2]]
overview <- overview[-1,]

f <- function(x){
  out <- iconv(x, "latin1", "ASCII", sub="")
  out <- gsub('[\\$,]', '', out) 
  out <- as.numeric(out)
  return(out)
}

overview <- matrix(f(as.character(unlist(overview))), ncol = ncol(overview))
overview <- as.data.frame(overview)
names(overview) <- c('year', 'fires', 'civ.deaths', 'civ.injuries', 'ff.deaths',
                     'ff.injuries', 'damage.reported', 'damage.2010dollars')

I get exactly what I want except that the values in the superscripts are appended to the end of the values in the table cells. For example, (using the row and column names from the url given above) Civilian Deaths in 2001 are stored as 61963 when they should be 6196 since the superscript 3 is interpreted as an extra digit. Any cells in the table that lack a superscript come out exactly correct.

After many hours struggling through the documentation, I was able to use the functions parseHTML and getNodeSet from the XML package to identify all of the nodes containing the <sup> tags, but couldn't figure out what to do from there:

overview <- htmlParse(url.overview)
getNodeSet(overview, "//sup")

I take it I somehow need to remove these parts of the XML tree, then pass the result back to readHTMLTable for further processing but I couldn't figure out how to do this.

I'd be very grateful for your thoughts.

like image 484
inhuretnakht Avatar asked Aug 21 '12 22:08

inhuretnakht


1 Answers

Try

require(XML)
url.overview <- "http://www.nfpa.org/itemDetail.asp?categoryID=953&itemID=23033"
overview <- htmlParse(url.overview,encoding="UTF-8")
temp<-getNodeSet(overview, "/*//span[@class=\"small\"]/sup")
removeNodes(temp)
app.data<-readHTMLTable(overview)[[2]]

so here we just remove the nodes we dont want and feed the remainder back into readHTMLTable taking the 2nd table. I was having issues with encoding on this windows box. You may want to leave the encoding in the htmlParse or it might work fine without for you.

like image 108
shhhhimhuntingrabbits Avatar answered Sep 18 '22 22:09

shhhhimhuntingrabbits