I am trying to accomplish a task using R to scrape data on a website. <ol> <li>I would like to go through each link on the following page: http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House Bills</li> <li>Select only items with Current Status showing "transmitted to the governor". For example, http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013</li> <li>And then scrapping the cells within STATUS TEXT for the following clause" Passed Final Reading". For example: Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0). </li> </ol> I have tried using previous examples with packages Rcurl and XML (in R), but I don't know how to use them correctly for aspx sites. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task. Thanks for any help, Tom

<pre class="prettyprint"><code>require(httr) require(XML) basePage <- "http://capitol.hawaii.gov" h <- handle(basePage) GET(handle = h) res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House") # parse content for "Transmitted to Governor" text resXML <- htmlParse(content(res, as = "text")) resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]') appRows <-sapply(resTable, xmlValue) include <- grepl("Transmitted to Governor", appRows) resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href') appUrls <- resUrls[include] # look at just the first res <- GET(handle = h, path = appUrls[1]) resXML <- htmlParse(content(res, as = "text")) xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue) [1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)." </code></pre> Let package <code>httr</code> handle all the background work by setting up a <code>handle</code>. If you want to run over all 92 links: <pre class="prettyprint"><code> # get all the links returned as a list (will take sometime) # print statement included for sanity res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x))); GET(handle = h, path = x)}) resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))}) appString <- sapply(resXML, function(x){ xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue) }) head(appString) > head(appString) $href [1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)." $href [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none." [2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)." $href [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none." [2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)." $href [1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 1 Excused: Ige." [2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)." $href [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none." [2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)." $href [1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none. 0 Excused: none." [2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)." </code></pre>

Scraping from aspx website using R

Tags:

r

web-scraping

I am trying to accomplish a task using R to scrape data on a website.

I would like to go through each link on the following page: http://capitol.hawaii.gov/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House Bills
Select only items with Current Status showing "transmitted to the governor". For example, http://capitol.hawaii.gov/measure_indiv.aspx?billtype=HB&billnumber=17&year=2013
And then scrapping the cells within STATUS TEXT for the following clause" Passed Final Reading". For example: Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0).

I have tried using previous examples with packages Rcurl and XML (in R), but I don't know how to use them correctly for aspx sites. So what I would love to have is: 1. Some suggestion on how to build such a code. 2. And recommendation for how to learn the knowledge needed for performing such a task.

Thanks for any help,

Tom

266

asked May 30 '13 01:05

yokota

1 Answers

require(httr)
require(XML)

basePage <- "http://capitol.hawaii.gov"

h <- handle(basePage)

GET(handle = h)

res <- GET(handle = h, path = "/advreports/advreport.aspx?year=2013&report=deadline&rpt_type=&measuretype=hb&title=House")

# parse content for "Transmitted to Governor" text
resXML <- htmlParse(content(res, as = "text"))
resTable <- getNodeSet(resXML, '//*/table[@id ="GridViewReports"]/tr/td[3]')
appRows <-sapply(resTable, xmlValue)
include <- grepl("Transmitted to Governor", appRows)
resUrls <- xpathSApply(resXML, '//*/table[@id ="GridViewReports"]/tr/td[2]//@href')

appUrls <- resUrls[include]

# look at just the first

res <- GET(handle = h, path = appUrls[1])

resXML <- htmlParse(content(res, as = "text"))


xpathSApply(resXML, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)

[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan,
 Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro,
 Tokioka voting no (4) and none excused (0)."

Let package httr handle all the background work by setting up a handle.

If you want to run over all 92 links:

 # get all the links returned as a list (will take sometime)
 # print statement included for sanity
 res <- lapply(appUrls, function(x){print(sprintf("Got url no. %d",which(appUrls%in%x)));
                                   GET(handle = h, path = x)})
 resXML <- lapply(res, function(x){htmlParse(content(x, as = "text"))})
 appString <- sapply(resXML, function(x){
                   xpathSApply(x, '//*[text()[contains(.,"Passed Final Reading")]]', xmlValue)
                      })


 head(appString)

>  head(appString)
$href
[1] "Passed Final Reading as amended in SD 2 with Representative(s) Fale, Jordan, Tsuji voting aye with reservations; Representative(s) Cabanilla, Morikawa, Oshiro, Tokioka voting no (4) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                                  
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Cullen, Har voting aye with reservations; Representative(s) McDermott voting no (1) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                                 
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; Representative(s) Hashem, McDermott voting no (2) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 24 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  1 Excused: Ige."                    
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and Representative(s) Say excused (1)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."                        
[2] "Passed Final Reading as amended in CD 1 with Representative(s) Johanson voting aye with reservations; none voting no (0) and none excused (0)."

$href
[1] "Passed Final Reading, as amended (CD 1). 25 Aye(s); Aye(s) with reservations: none . 0 No(es): none.  0 Excused: none."  
[2] "Passed Final Reading as amended in CD 1 with none voting aye with reservations; none voting no (0) and none excused (0)."

118

answered Nov 12 '22 16:11

user1609452

Related questions
                            
                                glm and relative risk -replicate Stata code in R
                            
                                Creating new named variable in dataframe using loop and naming convention
                            
                                R-project filepath from concatenation
                            
                                Inverse function of an unknown cumulative function
                            
                                freeing up memory in R
                            
                                Understanding 'gslider' function to make interactive plots
                            
                                access plyr id variables within functions
                            
                                R - How to apply different functions to certain rows in a column
                            
                                An "asymmetric" pairwise distance matrix
                            
                                Simulating time series random variable in R?
                            
                                How to use R base plots in grid.newpage?
                            
                                How can I color nodes of a graphNEL graph?
                            
                                Different data in upper and lower panel of scatterplot matrix
                            
                                Store large dataframes in redis through R
                            
                                understanding optimisation messages on assignment by reference in a data.table
                            
                                How to generate regular xts periods from random time observations?
                            
                                Removing Vertical White Lines in ggplot Line Graph
                            
                                C code called by R keeps crashing
                            
                                .SD columns in data.table in R
                            
                                Why lag in r is not working for matrix?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With