I'm trying to programatically search a website, but the submit button functionality seems to be primarily powered by JavaScript. I'm not overly familiar with how this works though, so I could be wrong. Here is the code I'm using: <pre class="prettyprint"><code>library(rvest) BASE_URL = 'https://mdocweb.state.mi.us/otis2/otis2.aspx' PARAMS = list(txtboxLName='Smith', drpdwnGender='Either', drpdwnRace='All', drpdwnStatus='All', submit='btnSearch') # rvest approach s = html_session(BASE_URL) form = html_form(s)[[1]] form = set_values(form, PARAMS) resp = submit_form(s, form, submit='btnSearch') # This gives an error # httr approach resp = httr::POST(BASE_URL, body=PARAMS, encode='form') html = httr::content(resp) # This just returns that same page I was on </code></pre> The HTML for the button looks like this: <pre class="prettyprint"><code><input type="submit" name="btnSearch" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;btnSearch&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" language="javascript" id="btnSearch" style="width:100px;"> </code></pre> Given the <code>onclick</code> attribute, my uneducated assumption is that the use of JavaScript is what is interfering with my approach. But again, I don't fully understand how all this works, so I could be wrong. Either way, how do I achieve my goal, if at all, using <code>rvest</code> or <code>httr</code>, but not <code>RSelenium</code>? Also, if this is achievable in Python, I'll accept that as well.

We first need to get the original search page since this is a sharepoint site (or acts like one) and we need some hidden form fields to use later on: <pre class="prettyprint"><code>library(httr) library(rvest) library(tidyverse) pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx") setNames( html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"), html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name") ) -> hidden str(hidden) ## Named chr [1:3] "x62pLbphYWUDXsdoNdBBNrxqyHHI+K06BzjFwdP3Uooafgey2uG1gLWxzh07djRxiQR724uplZFAI8klbq6HCSkmrp8jP15EMwvkDM/biUEuQrf"| __truncated__ ... ## - attr(*, "names")= chr [1:3] "__VIEWSTATE" "__VIEWSTATEGENERATOR" "__EVENTVALIDATION" </code></pre> Now, we need to act like the form and use HTTP <code>POST</code> to submit it: <pre class="prettyprint"><code>POST( url = "https://mdocweb.state.mi.us/otis2/otis2.aspx", add_headers( Origin = "https://mdocweb.state.mi.us", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36", Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx" ), body = list( `__EVENTTARGET` = "", `__EVENTARGUMENT` = "", `__VIEWSTATE` = hidden["__VIEWSTATE"], `__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"], `__EVENTVALIDATION` = hidden["__EVENTVALIDATION"], txtboxLName = "Smith", txtboxFName = "", txtboxMDOCNum = "", drpdwnGender = "Either", drpdwnRace = "All", txtboxAge = "", drpdwnStatus = "All", txtboxMarks = "", btnSearch = "Search" ), encode = "form" ) -> res </code></pre> We're going to need this helper function in a minute: <pre class="prettyprint"><code>mcga <- function(x) { x <- tolower(x) x <- gsub("[[:punct:][:space:]]+", "_", x) x <- gsub("_+", "_", x) x <- gsub("(^_|_$)", "", x) make.unique(x, sep = "_") } </code></pre> Now, we need the HTML from the results page: <pre class="prettyprint"><code>pg <- content(res, as="parsed") </code></pre> Unfortunately, the "table" is really a set of <code><div></code>s. But, it's programmatically generated and pretty uniform. We don't want to type much so let's first get the column names we'll be using later on: <pre class="prettyprint"><code>col_names <- html_nodes(pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga() ## [1] "offender_number" "last_name" "first_name" ## [4] "date_of_birth" "sex" "race" ## [7] "mcl_number" "location" "status" ## [10] "parole_board_jurisdiction_date" "maximum_date" "date_paroled" </code></pre> The site is pretty nice in that it accommodates folks with disabilities by providing screen-reader hints. Unfortunately, this puts a kink in scraping since we wld either have to be verbose in targeting the tags with values or clean up text later on. Thankfully, the <code>xml2</code> 📦 now has the ability to remove nodes: <pre class="prettyprint"><code>xml_find_all(pg, ".//div[@class='screenReaderOnly']") %>% xml_remove() xml_find_all(pg, ".//span[@class='visible-phone']") %>% xml_remove() </code></pre> We can now collect all the offender records <code><div></code> "rows": <pre class="prettyprint"><code>records <- html_nodes(pg, "div.offenderRow") </code></pre> And, succinctly get them into a data frame: <pre class="prettyprint"><code>map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{ html_nodes(records, xpath=.x) %>% html_text(trim=TRUE) }) %>% set_names(col_names) %>% bind_cols() %>% readr::type_convert() -> xdf xdf ## # A tibble: 25 x 12 ## offender_number last_name first_name date_of_birth sex race mcl_number location status ## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 544429 SMITH AARICK 12/03/1967 M White 333.74012D3 Gladwin Parole ## 2 210262 SMITH AARON 05/27/1972 M Black <NA> <NA> Dischrg ## 3 372965 SMITH AARON 09/16/1973 M White <NA> <NA> Dischrg ## 4 413411 SMITH AARON 07/13/1973 M Black <NA> <NA> Dischrg ## 5 618210 SMITH AARON 10/12/1984 M Black <NA> <NA> Dischrg ## 6 675823 SMITH AARON 05/19/1989 M Black 333.74032A5 Det Lahser Prob Prob ## 7 759548 SMITH AARON 06/19/1990 M Black <NA> <NA> Dischrg ## 8 763189 SMITH AARON 07/15/1976 M White 333.74032A5 Mt. Pleasant Prob ## 9 854557 SMITH AARON 12/27/1973 M White <NA> <NA> Dischrg ## 10 856804 SMITH AARON 02/24/1989 M White 750.110A2 Harrison CF Prison ## # ... with 15 more rows, and 3 more variables: parole_board_jurisdiction_date <chr>, maximum_date <chr>, ## # date_paroled <chr> glimpse(xdf) ## Observations: 25 ## Variables: 12 ## $ offender_number <int> 544429, 210262, 372965, 413411, 618210, 675823, 759548, 763189, 854557, 85... ## $ last_name <chr> "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "S... ## $ first_name <chr> "AARICK", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "... ## $ date_of_birth <chr> "12/03/1967", "05/27/1972", "09/16/1973", "07/13/1973", "10/12/1984", "05/... ## $ sex <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",... ## $ race <chr> "White", "Black", "White", "Black", "Black", "Black", "Black", "White", "W... ## $ mcl_number <chr> "333.74012D3", NA, NA, NA, NA, "333.74032A5", NA, "333.74032A5", NA, "750.... ## $ location <chr> "Gladwin", NA, NA, NA, NA, "Det Lahser Prob", NA, "Mt. Pleasant", NA, "Har... ## $ status <chr> "Parole", "Dischrg", "Dischrg", "Dischrg", "Dischrg", "Prob", "Dischrg", "... ## $ parole_board_jurisdiction_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "11/28/2024", "03/25/2016", NA, NA, NA... ## $ maximum_date <chr> NA, "09/03/2015", "06/29/2016", "10/02/2017", "05/19/2017", "07/18/2019", ... ## $ date_paroled <chr> "11/15/2016", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ... </code></pre> I had hoped the <code>type_convert</code> wld provide better transforms, esp for the date column(s) but it didn't and can likely be eliminated. Now, you'll need to do some more work with the results page since since the results are paginated. Thankfully, you know the page info: <pre class="prettyprint"><code>xml_integer(html_nodes(pg, "span#lblPgCurrent")) ## [1] 1 xml_integer(html_nodes(pg, "span#lblTotalPgs")) ## [1] 101 </code></pre> You'll have to do the "hidden" dance again: <pre class="prettyprint"><code>html_nodes(pg, "input[type='hidden']") </code></pre> (follow above ref for what to do with that) and rejigger a new <code>POST</code> call that only has those hidden fields and one more form element: <code>btnNext = 'Next'</code>. You'll need to repeat this over all the individual pages in the paginated result set then finally <code>bind_rows()</code> everything. I shld add that as you figure out the pagination workflow, start with a fresh blank search page grab. The sharepoint server seems to be configured with a pretty small viewstate session cache timeout and code will break if you wait too long between iterations. UPDATE I kinda wanted to make sure that last bit of advice worked so there's this: <pre class="prettyprint"><code>library(httr) library(rvest) library(tidyverse) mcga <- function(x) { x <- tolower(x) x <- gsub("[[:punct:][:space:]]+", "_", x) x <- gsub("_+", "_", x) x <- gsub("(^_|_$)", "", x) make.unique(x, sep = "_") } start_search <- function(last_name) { pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx") setNames( html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"), html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name") ) -> hidden POST( url = "https://mdocweb.state.mi.us/otis2/otis2.aspx", add_headers( Origin = "https://mdocweb.state.mi.us", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36", Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx" ), body = list( `__EVENTTARGET` = "", `__EVENTARGUMENT` = "", `__VIEWSTATE` = hidden["__VIEWSTATE"], `__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"], `__EVENTVALIDATION` = hidden["__EVENTVALIDATION"], txtboxLName = last_name, txtboxFName = "", txtboxMDOCNum = "", drpdwnGender = "Either", drpdwnRace = "All", txtboxAge = "", drpdwnStatus = "All", txtboxMarks = "", btnSearch = "Search" ), encode = "form" ) -> res content(res, as="parsed") } extract_results <- function(results_pg) { col_names <- html_nodes(results_pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga() xml_find_all(results_pg, ".//div[@class='screenReaderOnly']") %>% xml_remove() xml_find_all(results_pg, ".//span[@class='visible-phone']") %>% xml_remove() records <- html_nodes(results_pg, "div.offenderRow") map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{ html_nodes(records, xpath=.x) %>% html_text(trim=TRUE) }) %>% set_names(col_names) %>% bind_cols() } current_page_number <- function(results_pg) { xml_integer(html_nodes(results_pg, "span#lblPgCurrent")) } last_page_number <- function(results_pg) { xml_integer(html_nodes(results_pg, "span#lblTotalPgs")) } scrape_status <- function(results_pg) { cur <- current_page_number(results_pg) tot <- last_page_number(results_pg) message(sprintf("%s of %s", cur, tot)) } next_page <- function(results_pg) { cur <- current_page_number(results_pg) tot <- last_page_number(results_pg) if (cur == tot) return(NULL) setNames( html_nodes(results_pg, "input[type='hidden']") %>% html_attr("value"), html_nodes(results_pg, "input[type='hidden']") %>% html_attr("name") ) -> hidden POST( url = "https://mdocweb.state.mi.us/otis2/otis2.aspx", add_headers( Origin = "https://mdocweb.state.mi.us", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36", Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx" ), body = list( `__EVENTTARGET` = hidden["__EVENTTARGET"], `__EVENTARGUMENT` = hidden["__EVENTARGUMENT"], `__VIEWSTATE` = hidden["__VIEWSTATE"], `__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"], `__EVENTVALIDATION` = hidden["__EVENTVALIDATION"], btnNext = 'Next' ), encode = "form" ) -> res content(res, as="parsed") } curr_pg <- start_search("smith") results_df <- extract_results(curr_pg) pb <- progress_estimated(last_page_number(curr_pg)-1) repeat{ scrape_status(curr_pg) # optional esp since we have a progress bar pb$tick()$print() curr_pg <- next_page(curr_pg) if (is.null(curr_pg)) break results_df <- bind_rows(results_df, extract_results(next_pg)) Sys.sleep(5) # be kind } </code></pre> Hopefully you can follow along, but that shd get all the pages for you for a given search term.

How to submit a form that seems to be handled by JavaScript using httr or rvest?

Tags:

post

r

web-scraping

httr

rvest

I'm trying to programatically search a website, but the submit button functionality seems to be primarily powered by JavaScript. I'm not overly familiar with how this works though, so I could be wrong.

Here is the code I'm using:

library(rvest)

BASE_URL = 'https://mdocweb.state.mi.us/otis2/otis2.aspx'
PARAMS = list(txtboxLName='Smith', 
              drpdwnGender='Either', 
              drpdwnRace='All', 
              drpdwnStatus='All',
              submit='btnSearch')

# rvest approach
s = html_session(BASE_URL)
form = html_form(s)[[1]]
form = set_values(form, PARAMS)
resp = submit_form(s, form, submit='btnSearch') # This gives an error

# httr approach
resp = httr::POST(BASE_URL, body=PARAMS, encode='form')
html = httr::content(resp) # This just returns that same page I was on

The HTML for the button looks like this:

<input type="submit" name="btnSearch" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;btnSearch&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" language="javascript" id="btnSearch" style="width:100px;">

Given the onclick attribute, my uneducated assumption is that the use of JavaScript is what is interfering with my approach. But again, I don't fully understand how all this works, so I could be wrong.

Either way, how do I achieve my goal, if at all, using rvest or httr, but not RSelenium? Also, if this is achievable in Python, I'll accept that as well.

637

asked Sep 14 '17 16:09

tblznbits

1 Answers

We first need to get the original search page since this is a sharepoint site (or acts like one) and we need some hidden form fields to use later on:

library(httr)
library(rvest)
library(tidyverse)

pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")

setNames(
  html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
  html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
) -> hidden

str(hidden)
## Named chr [1:3] "x62pLbphYWUDXsdoNdBBNrxqyHHI+K06BzjFwdP3Uooafgey2uG1gLWxzh07djRxiQR724uplZFAI8klbq6HCSkmrp8jP15EMwvkDM/biUEuQrf"| __truncated__ ...
## - attr(*, "names")= chr [1:3] "__VIEWSTATE" "__VIEWSTATEGENERATOR" "__EVENTVALIDATION"

Now, we need to act like the form and use HTTP POST to submit it:

POST(
  url = "https://mdocweb.state.mi.us/otis2/otis2.aspx", 
  add_headers(
    Origin = "https://mdocweb.state.mi.us", 
    `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36", 
    Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
  ), 
  body = list(
    `__EVENTTARGET` = "", 
    `__EVENTARGUMENT` = "", 
    `__VIEWSTATE` = hidden["__VIEWSTATE"],
    `__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
    `__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
    txtboxLName = "Smith", 
    txtboxFName = "", 
    txtboxMDOCNum = "", 
    drpdwnGender = "Either", 
    drpdwnRace = "All", 
    txtboxAge = "", 
    drpdwnStatus = "All", 
    txtboxMarks = "", 
    btnSearch = "Search"
  ), 
  encode = "form"
) -> res

We're going to need this helper function in a minute:

mcga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  make.unique(x, sep = "_")
}

Now, we need the HTML from the results page:

pg <- content(res, as="parsed")

Unfortunately, the "table" is really a set of <div>s. But, it's programmatically generated and pretty uniform. We don't want to type much so let's first get the column names we'll be using later on:

col_names <- html_nodes(pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()
##  [1] "offender_number"                "last_name"                      "first_name"                    
##  [4] "date_of_birth"                  "sex"                            "race"                          
##  [7] "mcl_number"                     "location"                       "status"                        
## [10] "parole_board_jurisdiction_date" "maximum_date"                   "date_paroled"

The site is pretty nice in that it accommodates folks with disabilities by providing screen-reader hints. Unfortunately, this puts a kink in scraping since we wld either have to be verbose in targeting the tags with values or clean up text later on. Thankfully, the xml2 📦 now has the ability to remove nodes:

xml_find_all(pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()
xml_find_all(pg, ".//span[@class='visible-phone']") %>% xml_remove()

We can now collect all the offender records <div> "rows":

records <- html_nodes(pg, "div.offenderRow")

And, succinctly get them into a data frame:

map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
  html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
}) %>% 
  set_names(col_names) %>% 
  bind_cols() %>% 
  readr::type_convert() -> xdf

xdf
## # A tibble: 25 x 12
##    offender_number last_name first_name date_of_birth   sex  race  mcl_number        location  status
##              <int>     <chr>      <chr>         <chr> <chr> <chr>       <chr>           <chr>   <chr>
##  1          544429     SMITH     AARICK    12/03/1967     M White 333.74012D3         Gladwin  Parole
##  2          210262     SMITH      AARON    05/27/1972     M Black        <NA>            <NA> Dischrg
##  3          372965     SMITH      AARON    09/16/1973     M White        <NA>            <NA> Dischrg
##  4          413411     SMITH      AARON    07/13/1973     M Black        <NA>            <NA> Dischrg
##  5          618210     SMITH      AARON    10/12/1984     M Black        <NA>            <NA> Dischrg
##  6          675823     SMITH      AARON    05/19/1989     M Black 333.74032A5 Det Lahser Prob    Prob
##  7          759548     SMITH      AARON    06/19/1990     M Black        <NA>            <NA> Dischrg
##  8          763189     SMITH      AARON    07/15/1976     M White 333.74032A5    Mt. Pleasant    Prob
##  9          854557     SMITH      AARON    12/27/1973     M White        <NA>            <NA> Dischrg
## 10          856804     SMITH      AARON    02/24/1989     M White   750.110A2     Harrison CF  Prison
## # ... with 15 more rows, and 3 more variables: parole_board_jurisdiction_date <chr>, maximum_date <chr>,
## #   date_paroled <chr>

glimpse(xdf)
## Observations: 25
## Variables: 12
## $ offender_number                <int> 544429, 210262, 372965, 413411, 618210, 675823, 759548, 763189, 854557, 85...
## $ last_name                      <chr> "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "SMITH", "S...
## $ first_name                     <chr> "AARICK", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "AARON", "...
## $ date_of_birth                  <chr> "12/03/1967", "05/27/1972", "09/16/1973", "07/13/1973", "10/12/1984", "05/...
## $ sex                            <chr> "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M",...
## $ race                           <chr> "White", "Black", "White", "Black", "Black", "Black", "Black", "White", "W...
## $ mcl_number                     <chr> "333.74012D3", NA, NA, NA, NA, "333.74032A5", NA, "333.74032A5", NA, "750....
## $ location                       <chr> "Gladwin", NA, NA, NA, NA, "Det Lahser Prob", NA, "Mt. Pleasant", NA, "Har...
## $ status                         <chr> "Parole", "Dischrg", "Dischrg", "Dischrg", "Dischrg", "Prob", "Dischrg", "...
## $ parole_board_jurisdiction_date <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, "11/28/2024", "03/25/2016", NA, NA, NA...
## $ maximum_date                   <chr> NA, "09/03/2015", "06/29/2016", "10/02/2017", "05/19/2017", "07/18/2019", ...
## $ date_paroled                   <chr> "11/15/2016", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

I had hoped the type_convert wld provide better transforms, esp for the date column(s) but it didn't and can likely be eliminated.

Now, you'll need to do some more work with the results page since since the results are paginated. Thankfully, you know the page info:

xml_integer(html_nodes(pg, "span#lblPgCurrent"))
## [1] 1

xml_integer(html_nodes(pg, "span#lblTotalPgs"))
## [1] 101

You'll have to do the "hidden" dance again:

html_nodes(pg, "input[type='hidden']")

(follow above ref for what to do with that) and rejigger a new POST call that only has those hidden fields and one more form element: btnNext = 'Next'. You'll need to repeat this over all the individual pages in the paginated result set then finally bind_rows() everything.

I shld add that as you figure out the pagination workflow, start with a fresh blank search page grab. The sharepoint server seems to be configured with a pretty small viewstate session cache timeout and code will break if you wait too long between iterations.

UPDATE

I kinda wanted to make sure that last bit of advice worked so there's this:

library(httr)
library(rvest)
library(tidyverse)

mcga <- function(x) {
  x <- tolower(x)
  x <- gsub("[[:punct:][:space:]]+", "_", x)
  x <- gsub("_+", "_", x)
  x <- gsub("(^_|_$)", "", x)
  make.unique(x, sep = "_")
}

start_search <- function(last_name) {

  pre_pg <- read_html("https://mdocweb.state.mi.us/otis2/otis2.aspx")

  setNames(
    html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("value"),
    html_nodes(pre_pg, "input[type='hidden']") %>% html_attr("name")
  ) -> hidden

  POST(
    url = "https://mdocweb.state.mi.us/otis2/otis2.aspx", 
    add_headers(
      Origin = "https://mdocweb.state.mi.us", 
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36", 
      Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
    ),
    body = list(
      `__EVENTTARGET` = "", 
      `__EVENTARGUMENT` = "", 
      `__VIEWSTATE` = hidden["__VIEWSTATE"],
      `__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
      `__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
      txtboxLName = last_name, 
      txtboxFName = "", 
      txtboxMDOCNum = "", 
      drpdwnGender = "Either", 
      drpdwnRace = "All", 
      txtboxAge = "", 
      drpdwnStatus = "All", 
      txtboxMarks = "", 
      btnSearch = "Search"
    ),  
    encode = "form"
  ) -> res

  content(res, as="parsed")

} 

extract_results <- function(results_pg) {

  col_names <- html_nodes(results_pg, "a.headings") %>% html_text(trim=TRUE) %>% mcga()

  xml_find_all(results_pg, ".//div[@class='screenReaderOnly']") %>% xml_remove()

  xml_find_all(results_pg, ".//span[@class='visible-phone']") %>% xml_remove()

  records <- html_nodes(results_pg, "div.offenderRow")

  map(sprintf(".//div[@class='span1 searchCol%s']", 1:12), ~{
    html_nodes(records, xpath=.x) %>% html_text(trim=TRUE)
  }) %>% 
    set_names(col_names) %>% 
    bind_cols() 

}

current_page_number <- function(results_pg) {
  xml_integer(html_nodes(results_pg, "span#lblPgCurrent"))
}

last_page_number <- function(results_pg) {
  xml_integer(html_nodes(results_pg, "span#lblTotalPgs"))
}

scrape_status <- function(results_pg) {

  cur <- current_page_number(results_pg)
  tot <- last_page_number(results_pg)

  message(sprintf("%s of %s", cur, tot))

}

next_page <- function(results_pg) {

  cur <- current_page_number(results_pg)
  tot <- last_page_number(results_pg)

  if (cur == tot) return(NULL)

  setNames(
    html_nodes(results_pg, "input[type='hidden']") %>% html_attr("value"),
    html_nodes(results_pg, "input[type='hidden']") %>% html_attr("name")
  ) -> hidden

  POST(
    url = "https://mdocweb.state.mi.us/otis2/otis2.aspx", 
    add_headers(
      Origin = "https://mdocweb.state.mi.us", 
      `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.52 Safari/537.36", 
      Referer = "https://mdocweb.state.mi.us/otis2/otis2.aspx"
    ),
    body = list(
      `__EVENTTARGET` = hidden["__EVENTTARGET"],
      `__EVENTARGUMENT` = hidden["__EVENTARGUMENT"],
      `__VIEWSTATE` = hidden["__VIEWSTATE"],
      `__VIEWSTATEGENERATOR` = hidden["__VIEWSTATEGENERATOR"],
      `__EVENTVALIDATION` = hidden["__EVENTVALIDATION"],
      btnNext = 'Next'
    ),  
    encode = "form"
  ) -> res

  content(res, as="parsed")

}

curr_pg <- start_search("smith")
results_df <- extract_results(curr_pg)

pb <- progress_estimated(last_page_number(curr_pg)-1)

repeat{

  scrape_status(curr_pg) # optional esp since we have a progress bar

  pb$tick()$print()

  curr_pg <- next_page(curr_pg)

  if (is.null(curr_pg)) break

  results_df <- bind_rows(results_df, extract_results(next_pg))

  Sys.sleep(5) # be kind

}

Hopefully you can follow along, but that shd get all the pages for you for a given search term.

142

answered Oct 29 '22 04:10

hrbrmstr

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to submit a form that seems to be handled by JavaScript using httr or rvest?

Tags:

post

r

web-scraping

httr

rvest

tblznbits

People also ask

1 Answers

hrbrmstr

Recent Activity

Donate For Us

How to submit a form that seems to be handled by JavaScript using httr or rvest?

Tags:

post

r

web-scraping

httr

rvest

tblznbits

People also ask

1 Answers

hrbrmstr

Related questions

Recent Activity

Donate For Us