I am trying to scrape the daily forecast from FiveThirtyEight using rvest
, but my object of interest seems to be a javascript object, which I am having difficulty even locating where and what to look for. (I'm not well versed in CSS or Javascript, though I tried to educate myself in the last couple days.)
By inspecting the webpage element and CSS selector, I have figured out the following:
The location to look is <div id="polling-avg-chart">
, so I tried
library(rvest)
url <-
"https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"
url %>%
read_html() %>%
html_nodes("#polling-avg-chart")
without much success. The output is simply
{xml_nodeset (1)}
[1] <\div id="polling-avg-chart"></div>\n
The individual poll results in dots are in <g style="clip-path: url("#line-clippoll_avg");"> ... </g>
, where you see 502 locations in numbers. I'm guessing that I will have to translate cx
and cy
of each node into the appropriate percentages, which is done by <g class="flag-box" transform="translate(30, 161.44093322753096)">...</g>
and so on.
However I do not see the underlying data for the forecast line, not the dots.
<line class="hover-date-line hide-line">
change, and values such as <path class="link" d="M 0 171.40106812500002 C 15 171.40106812500002 15 170.94093803735575 30 170.94093803735575"></path>
change, and I'm guessing that these values are what's creating the daily forecast line.I did read a few other SO posts such as this but none of them seemed applicable to this particular problem. What would be the best way to get the forecast percentages in a neat dataframe?
Another way is to grab the resource directly.
In your browser, open Developer Tools (F12 in Chrome/Chromium), head to "Network", refresh (F5), and look for what looks like a nicely formatted JSON. When we've found it, we copy the link address (right-click on the resource > Copy link address).
library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)
url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"
r <- GET(url)
The whole data is there. The weights too, so you can probably recalculate those averages. The data as plotted is in "model"
:
dat <-
jsonlite::fromJSON(content(r, as = "text")) %>%
map(purrr::pluck, "model") %>%
bind_rows(.id = "party") %>%
mutate_all(readr::parse_guess)
# # A tibble: 5,288 x 5
# party candidate_name state forecastdate poll_avg
# <chr> <chr> <chr> <date> <dbl>
# 1 D Sanders USA 2016-07-01 36.5
# 2 D Clinton USA 2016-07-01 55.4
# 3 D Sanders USA 2016-06-30 37.0
# 4 D Clinton USA 2016-06-30 54.6
# 5 D Sanders USA 2016-06-29 37.0
# 6 D Clinton USA 2016-06-29 54.9
# 7 D Sanders USA 2016-06-28 37.2
# 8 D Clinton USA 2016-06-28 54.4
# 9 D Sanders USA 2016-06-27 37.4
# 10 D Clinton USA 2016-06-27 53.9
# # ... with 5,278 more rows
Reproduce graphs:
dat %>%
filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>%
ggplot(aes(forecastdate, poll_avg)) +
geom_line(aes(col = candidate_name)) +
facet_wrap(~party)
If you'd like interactivity:
library(dygraphs)
library(htmltools)
foo <- dat %>%
filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>%
split(.$party) %>%
map(~ {
select(.x, forecastdate, candidate_name, poll_avg) %>%
spread(candidate_name, poll_avg) %>%
{xts(.[-1], .[[1]])} %>%
dygraph(group = "poll-model") %>%
dyRangeSelector()
})
browsable(tagList(foo))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With