Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scraping asp javascript paginated tables behind search with R

i'm trying to pull the content on https://www.askebsa.dol.gov/epds/default.asp with either rvest or RSelenium but not finding guidance when the javascript page begins with a search box? it'd be great to just get all of this content into a simple CSV file.

after that, pulling the data from individual filings like https://www.askebsa.dol.gov/mewaview/View/Index/6219 seems possible.. but i'd also appreciate a clean recommendation to do that. thanks

like image 814
Anthony Damico Avatar asked Aug 10 '18 21:08

Anthony Damico


3 Answers

Here is an example use of RSelenium to get links to individual filings. The rest should be straightforward once you retrieve links. You can navigate to these URLs using rvest (as you already did before) and parse the content with the help of string manipulation tools such as stringr. For the second part, it would be optimistic to expect a systematic structure across all forms. Please try spend some time to construct specific regular expressions to pull what you need from the text retrieved.

The code below may not necessarily be the most efficient solution to your problem but it includes right RSelenium concept and ideas. Feel free to tweak it based on your needs.

Additional info: RSelenium: Basics

# devtools::install_github("ropensci/RSelenium")
library(RSelenium)

# launch a remote driver 
driver <- rsDriver(browser=c("chrome"))
remDr <- driver[["client"]]

# select an URL
url <- "https://www.askebsa.dol.gov/epds/default.asp"

# navigate to the URL
remDr$navigate(url)

# choose year - option[2] corresponds to 2017
year <- remDr$findElements(using = 'xpath',  '//*[@id="m1year"]/option[2]')
year[[1]]$clickElement()

# choose company
company <- remDr$findElements(using = 'xpath',  '//*[@id="m1company"]')
company[[1]]$sendKeysToElement(list("Bend"))

# enter ein
ein <- remDr$findElements(using = 'xpath',  '//*[@id="m1ein"]')
ein[[1]]$sendKeysToElement(list("81-6268978"))

# sumbit the form to get the results
submit <- remDr$findElements(using = 'xpath',  '//*[@id="cmdSubmitM1"]')
submit[[1]]$clickElement()

# get the total number of results
num_of_results <- remDr$findElements(using = 'xpath',  '//*[@id="block-system-main"]/div/div/div/div/div/div[1]/form/table[1]/tbody/tr/td/div/b[1]')
n <- as.integer(num_of_results[[1]]$getElementText()[[1]])

# loop through results and print the links
for(i in 1:n) {
  xpath <- paste0('//*[@id="block-system-main"]/div/div/div/div/div/div[1]/form/table[3]/tbody/tr[', i + 1, ']/td[1]/a')
  link <- remDr$findElements('xpath', xpath)
  print(link[[1]]$getElementAttribute('href'))
}

# [[1]]
# [1] "https://www.askebsa.dol.gov/mewaview/View/Index/5589"
# 
# [[1]]
# [1] "https://www.askebsa.dol.gov/mewaview/View/Index/6219"

Please note that if you don't narrow down your search, you will get more than 50 results and therefore more than one page of results. In this case, you would need additional adjustments in the code (the structure of xpath inside for loop may change, you may need to navigate to extra pages, the loop should be limited to 50 iterations etc).

I believe this covers your actual problem, which was dynamic scraping. You may want to post your follow up questions separately as they include different concepts. There are a lot of regex experts out there who would help you parse those forms as long as you address this specific issue in a different question with suitable tags.

like image 113
Ozan Avatar answered Nov 19 '22 22:11

Ozan


For the first part of the problem, this approach using rvest should work. I am receiving an error in the last step where it cannot find the required name-tag.

Here is my approach -

# open a html-session
web_session <- html_session("https://www.askebsa.dol.gov/epds/default.asp")
# get the form
test_search <- html_form(read_html("https://www.askebsa.dol.gov/epds/default.asp"))[[2]]

# set the required values for fields such as company_name, ein_number etc
# pass that info and submit the form - here i am getting an error 
# it cannot recognize the 'search button' name 
# if that is resolved it should work
set_values(test_search, 'm1company' = "Bend", 'm1ein' = '81-6268978' ) %>%
  submit_form(web_session, ., submit = "cmdSubmitM1") %>%
  read_html(.) -> some_html

If I get time I will try to do some more research and get back to you. I found a couple of tutorials and SO questions on similar topics here and here. They are a bit old but still useful.

For the second part its easier since you don't have any dynamic elements involved. I was able to retrieve all the addresses in the form by using the "selector-gadget" and copy pasting all the node names into the html_nodes() function.

# read the file and save it into a nested list
test_file_with_address <- read_html("https://www.askebsa.dol.gov/mewaview/View/Index/6219")

# copy paste all the css node names and get the text from the html file
test_file_with_address %>%
  html_nodes(".border-top:nth-child(19) code , .border-top:nth-child(18) code , .border-top:nth-child(14) code , .border-top:nth-child(13) code , .border-top:nth-child(12) code , .border-top:nth-child(11) code , .border-top:nth-child(9) code , .section-header+ .border-top code
") %>% html_text()

[1] "\r\n                Bend Chamber of Commerce Benefit Plan and Trust for Wood Products Employers\r\n                777 N.W. Wall Street, Suite 200\r\n                Bend,  OR  97703\r\n                \r\n                "
 [2] "(541) 382-3221"                                                                                                                                                                                                                
 [3] "81-6268978"                                                                                                                                                                                                                    
 [4] "501"                                                                                                                                                                                                                           
 [5] "\r\n                Bend Chamber of Commerce\r\n                777 N.W. Wall Street, Suite 200\r\n                Bend,  OR  97703\r\n                \r\n                "                                                   
 [6] "(541) 382-3221"                                                                                                                                                                                                                
 [7] "93-0331932"                                                                                                                                                                                                                    
 [8] "\r\n                Katy Brooks\r\n                Bend Chamber of Commerce\r\n                777 N.W. Wall Street, Suite 200\r\n                Bend,  OR  97703\r\n                \r\n                "                    
 [9] "(541) 382-3221"                                                                                                                                                                                                                
[10] "[email protected]"                                                                                                                                                                                                          
[11] "\r\n                Deb Oster\r\n                Scott Logging/Scott Transportation\r\n                400 S.W. Bluff Drive, #101\r\n                Bend,  OR  97702\r\n                \r\n                "                 
[12] "(541) 536-3636"                                                                                                                                                                                                                
[13] "[email protected]"                                                                                                                                                                                                       
[14] "\r\n                Karen Gibbons\r\n                Allen & Gibbons Logging\r\n                P.O. Box 754\r\n                Canyonville,  OR  97417\r\n                \r\n                "                               
[15] "(541) 839-4294"                                                                                                                                                                                                                
[16] "[email protected]"                                                                                                                                                                                                   
[17] "\r\n                Cascade East Benefits\r\n                dba Johnson Benefit Planning\r\n                777 N.W. Wall Street, Suite 100\r\n                Bend,  OR  97703\r\n                \r\n                "      
[18] "(541) 382-3571"                                                                                                                                                                                                                
[19] "[email protected]"                                                                                                                                                                                                
[20] "93-1130374"                                                                                                                                                                                                                    
[21] "\r\n                PacificSource Health Plans\r\n                P.O. Box 7068\r\n                Springfield,  OR  97475-0068\r\n                \r\n                "                                                       
[22] "(541) 686-1242"                                                                                                                                                                                                                
[23] "[email protected]"                                                                                                                                                                                              
[24] "93-0245545"                                                                                                                                                                                                                    
[25] "\r\n                PacificSource Health Plans\r\n                P.O. Box 7068\r\n                Springfield,  OR  97475-0068\r\n                \r\n                "                                                       
[26] "(541) 686-1242"                                                                                                                                                                                                                
[27] "[email protected]"                                                                                                                                                                                             
[28] "93-0245545"                                                                                                                                                                                                                    
[29] "N/A"

This requires some more regex magic to clean up and get them in a data.frame but the basic building blocks are there to see.

like image 30
Suhas Hegde Avatar answered Nov 19 '22 22:11

Suhas Hegde


In order to get the results you'll have to fill in the form and submit it. You can find the url and field names by inspecting the html.

url <- "https://www.askebsa.dol.gov/epds/m1results.asp"

post_data <- list(
    m1year = 'ALL',         # Year
    m1company = '',         # Name of MEWA (starts with)
    m1ein = '',             # EIN
    m1state = 'ALL',        # State of MEWA Headquarters
    m1coverage = 'ALL',     # State(s) where MEWA offers coverage
    m1filingtype = 'ALL',   # Type of filing
    cmdSubmitM1 = 'Search',
    # hidden fields
    auth = 'Y', 
    searchtype = 'Q', 
    sf = 'EIN', 
    so = 'A'
)

Now we can submit the form and collect the links. We can scrape the links with this selector table.table.table-condensed td a.

html <- read_html(POST(url, body = post_data, encode = "form"))
links <- html_nodes(html, 'table.table.table-condensed td a') %>% html_attr("href") 
links <- paste0("https://www.askebsa.dol.gov", links) 

This produces all the links of the first page.

Inspecting the HTTP traffic I noticed that the next page is loaded by submitting the same form with some extra fields (m1formid, allfilings, page). We can get the next pages by increasing the page value in a loop.

library(httr)
library(rvest)

url <- "https://www.askebsa.dol.gov/epds/m1results.asp"
post_data <- list(
    m1year='ALL', m1company='', m1ein='', m1state='all', 
    m1coverage='all', m1filingtype='ALL', cmdSubmitM1 = 'Search',
    auth='Y', searchtype='Q', sf='EIN', so='A', 
    m1formid='', allfilings='', page=1
)
links = list()

while (TRUE) {
    html <- read_html(POST(url, body = post_data, encode = "form"))
    page_links <- html_nodes(html, 'table.table.table-condensed td a') %>% html_attr("href") %>% paste0("https://www.askebsa.dol.gov/", .) 
    links <- c(links, page_links)
    last <- html_text(tail(html_nodes(html, 'div.textnorm > a'), n=2)[1])
    if (last != 'Last') {
        break
    }
    post_data['page'] <- post_data[['page']] + 1
}

print(links)

For the second part of the question, I assume that the goal is to select the form items and their values. You could do that by selecting all div.question-inline tags and the next code tag for each item.

library(rvest)

url <- "https://www.askebsa.dol.gov/mewaview/View/Index/6219"
nodes <- html_nodes(read_html(url), 'div.question-inline, div.question')
data <- list()

for (i in nodes) {
    n = trimws(html_text(html_node(i, xpath='./text()')))

    if (length(html_nodes(i, 'code')) == 0) {
        text <- html_nodes(i, xpath = '../address/code/text()')
        v <- paste(trimws(text), collapse = '\r\n')
    } else {
        v <- html_text(html_nodes(i, 'code'))
    }
    data[[n]] = v
}

print(data)

This code produces a named list with all the form items, but can be modified to produce a nested list or a more appropriate structure.
At this point I must say that I have very little experience with R, so this code is propably not a good coding example. Any tips or other comments are very welcome.

like image 1
t.m.adam Avatar answered Nov 20 '22 00:11

t.m.adam