I am trying to accomplish a task using R to scrape data on a website.
I would like to go through each link on the following page: http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House Bills
Select only items with Current Status showing "transmitted to the governor". For example, http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
And then scrapping the cells within STATUS TEXT for the following clause" Passed Final Reading". For example: Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0).
I have tried using previous examples with packages Rcurl and XML (in R), but I don't know how to use them correctly for aspx sites. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.
Thanks for any help,
Tom
The commonly used web Scraping tools for R is rvest. Install the package rvest in your R Studio using the following code. Having, knowledge of HTML and CSS will be an added advantage. It's observed that most of the Data Scientists are not very familiar with technical knowledge of HTML and CSS.
Web scraping in R is all about finding, extracting, and formatting data for later analysis. Because of R's built-in tools and libraries, web scraping in R is both easy and scalable. That's why it should be no surprise that it's one of the most popular programming languages in the data science community.
require(httr)
require(XML)
basePage <- "http://capitol.hawaii.gov"
h <- handle(basePage)
GET(handle = h)
res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")
# parse content for "Transmitted to Governor" text
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Transmitted to Governor", appRows)
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')
appUrls <- resUrls[include]
# look at just the first
res <- GET(handle = h, path = appUrls[1])
resXML <- htmlParse(content(res, as = "text"))
xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
Tokioka voting no (4) and none excused (0)."
Let package httr
handle all the background work by setting up a handle
.
If you want to run over all 92 links:
# get all the links returned as a list (will take sometime)
# print statement included for sanity
res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
GET(handle = h, path = x)})
resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
appString <- sapply(resXML, function(x){
xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
})
head(appString)
> head(appString)
$href
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 1 Excused: Ige."
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."
$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none."
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With