Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rvest web scraping with javascript

I am trying to scrape the daily forecast from FiveThirtyEight using rvest, but my object of interest seems to be a javascript object, which I am having difficulty even locating where and what to look for. (I'm not well versed in CSS or Javascript, though I tried to educate myself in the last couple days.)

By inspecting the webpage element and CSS selector, I have figured out the following:

  • The location to look is <div id="polling-avg-chart">, so I tried

    library(rvest)
    url <- 
      "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"
    
    url %>% 
      read_html() %>% 
      html_nodes("#polling-avg-chart")
    

    without much success. The output is simply

    {xml_nodeset (1)}

    [1] <\div id="polling-avg-chart"></div>\n

  • The individual poll results in dots are in <g style="clip-path: url("#line-clippoll_avg");"> ... </g>, where you see 502 locations in numbers. I'm guessing that I will have to translate cx and cy of each node into the appropriate percentages, which is done by <g class="flag-box" transform="translate(30, 161.44093322753096)">...</g> and so on.

  • However I do not see the underlying data for the forecast line, not the dots.

  • When I let my cursor hover over the chart, I see things such as <line class="hover-date-line hide-line"> change, and values such as <path class="link" d="M 0 171.40106812500002 C 15 171.40106812500002 15 170.94093803735575 30 170.94093803735575"></path> change, and I'm guessing that these values are what's creating the daily forecast line.
  • But where these values are stored, and how to translate it back to things like "49.1% Clinton vs. 26.6% Sanders" is still a mystery to me.

I did read a few other SO posts such as this but none of them seemed applicable to this particular problem. What would be the best way to get the forecast percentages in a neat dataframe?

like image 565
Kim Avatar asked May 17 '18 00:05

Kim


1 Answers

Another way is to grab the resource directly.

In your browser, open Developer Tools (F12 in Chrome/Chromium), head to "Network", refresh (F5), and look for what looks like a nicely formatted JSON. When we've found it, we copy the link address (right-click on the resource > Copy link address).

enter image description here

library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"

r <- GET(url)

The whole data is there. The weights too, so you can probably recalculate those averages. The data as plotted is in "model":

dat <- 
  jsonlite::fromJSON(content(r, as = "text")) %>% 
  map(purrr::pluck, "model") %>% 
  bind_rows(.id = "party") %>% 
  mutate_all(readr::parse_guess)

# # A tibble: 5,288 x 5
#    party candidate_name state forecastdate poll_avg
#    <chr> <chr>          <chr> <date>          <dbl>
#  1 D     Sanders        USA   2016-07-01       36.5
#  2 D     Clinton        USA   2016-07-01       55.4
#  3 D     Sanders        USA   2016-06-30       37.0
#  4 D     Clinton        USA   2016-06-30       54.6
#  5 D     Sanders        USA   2016-06-29       37.0
#  6 D     Clinton        USA   2016-06-29       54.9
#  7 D     Sanders        USA   2016-06-28       37.2
#  8 D     Clinton        USA   2016-06-28       54.4
#  9 D     Sanders        USA   2016-06-27       37.4
# 10 D     Clinton        USA   2016-06-27       53.9
# # ... with 5,278 more rows

Reproduce graphs:

dat %>% 
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>% 
  ggplot(aes(forecastdate, poll_avg)) +
  geom_line(aes(col = candidate_name)) +
  facet_wrap(~party)

enter image description here

If you'd like interactivity:

library(dygraphs)
library(htmltools)

foo <- dat %>% 
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>% 
  split(.$party) %>% 
  map(~ {
    select(.x, forecastdate, candidate_name, poll_avg) %>% 
      spread(candidate_name, poll_avg) %>% 
      {xts(.[-1], .[[1]])} %>%
      dygraph(group = "poll-model") %>% 
      dyRangeSelector()
  })

browsable(tagList(foo))

enter image description here

like image 70
Aurèle Avatar answered Oct 05 '22 02:10

Aurèle