Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - How to extract items from XML Nodeset?

I have a list of 438 pitcher names that look like this (in XML Nodeset):

> pitcherlinks[[1]]
<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
  <a href="/players/a/abadfe01.shtml">Fernando Abad</a>*
</td> 

> pitcherlinks[[2]]
<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
  <a href="/players/a/adlemti01.shtml">Tim Adleman</a>
</td> 

How do I extract the names like Fernando Abad and the associated links like /players/a/abadfe01.shtml

like image 548
IRNotSmart Avatar asked Dec 29 '25 05:12

IRNotSmart


1 Answers

Since you have a list, an apply function is used to walk through the list. Each function uses read_html to parse the hmtl fragment in the list using the CSS selector a to find the anchors (links). The names come from the html_text and the link is in the attribute href

library(rvest)
pitcherlinks <- list()
pitcherlinks[[1]] <- 
'<td class="left " data-append-csv="abadfe01" data-stat="player" csk="Abad,Fernando0.01">
  <a href="/players/a/abadfe01.shtml">Fernando Abad</a>*
    </td>'

pitcherlinks[[2]] <- 
  '<td class="left " data-append-csv="adlemti01" data-stat="player" csk="Adleman,Tim0.01">
    <a href="/players/a/adlemti01.shtml">Tim Adleman</a>
      </td>'

names <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_text()})
links <- sapply(pitcherlinks, function(x) {x %>% read_html() %>% html_nodes("a") %>% html_attr("href")})

names
# [1] "Fernando Abad" "Tim Adleman"  
links
# [1] "/players/a/abadfe01.shtml"  "/players/a/adlemti01.shtml"
like image 74
Andrew Lavers Avatar answered Dec 30 '25 21:12

Andrew Lavers