Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping html table and its href Links in R

Tags:

html

r

xpath

rvest

I am trying to download a table that contains text and links. I can successfully download the table with the link text "Pass". However, instead of the text, I would like to capture the actual href URL.

library(dplyr)
library(rvest)
library(XML)
library(httr)
library(stringr)

link <- "http://www.qimedical.com/resources/method-suitability/"

qi_webpage <- read_html(link)

qi_table <- html_nodes(qi_webpage, 'table')
qi <- html_table(qi_table, header = TRUE)[[1]]
qi <- qi[,-1]

Above gives a nice dataframe. However the last column only contains the text "Pass" when I would like to have the link associated with it. I have tried to use the following to add the links, but they do not correspond to the correct row:

qi_get <- GET("http://www.qimedical.com/resources/method-suitability/")
qi_html <- htmlParse(content(qi_get, as="text"))

qi.urls <- xpathSApply(qi_html, "//*/td[7]/a", xmlAttrs, "href")
qi.urls <- qi.urls[1,]

qi <- mutate(qi, "MSTLink" = (ifelse(qi$`Study Protocol(click to download certification)` == "Pass", (t(qi.urls)), "")))

I know little about html, css, etc, so I am not sure what I am missing to accomplish this properly.

Thanks!!

like image 329
Alex Dometrius Avatar asked Dec 19 '22 07:12

Alex Dometrius


1 Answers

You're looking for a elements inside of table cells, td. Then you want the value of the href attribute. So here's one way, which will return a vector with all the URLs for the PDF downloads:

qi_webpage %>%
  html_nodes(xpath = "//td/a") %>% 
  html_attr("href")
like image 146
neilfws Avatar answered Jan 01 '23 09:01

neilfws