How to extract webpage data with a node and class in rvest

Question

I am performing webscraping on a site and have been able to get basic data, but I now need to collect data from a more complicated part of the page.

I am using rvest to pull data from the AAA gas prices website:

https://gasprices.aaa.com/

I am now trying to pull county-level data, which is only displayed on the map (if you hover your cursor over an individual county. I need to get the county gas prices for individual counties in different states. For example, if you click on Maine, to go to the Maine page (https://gasprices.aaa.com/?state=ME), I need to webscrape the price for Aroostook (the northernmost county on the map).

I have been able to use rvest to extract the data for the metro areas (lower on the page), using html_nodes and the node "td". However, the code for the map is more complex. Instead of the simple "td" node, the developer tools (in Chrome) gives <td class="fm-tooltip-comment">$4.928</td on the line with the price ($4.928 is the current price in Aroostook, as of the date of this post). I cannot seem to identify that with the rvest package to extract it.

I have read that the class can be used, or others have proposed using the css code to designate it within rvest, but I am unfamiliar with how to do so. Pulling the metro-area numbers was straightforward, however the county-level prices embedded within the map do not seem as accessible.

Is there a way to extract this county-level data so that I can webscrape in R? And, can this then be repeatable for all the counties/states from which I must select? Do I need the css code, and if so how do I access it/write it properly for rvest to use?

Dave2e · Accepted Answer

It looks like the information you are looking for is store in the "index.php" file that gets downloaded when the web page loads. The current link for Maine is "https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=89346&ver=5.9.3".
I am not sure what the r=89346 value is for, maybe a timestamp, tracking id, temporary token (to prevent web scraping) etc. I suspect this URL will change thus you may need to use the developer tools on the browser to obtain the current url.
Also, map_id refers to state but I don't know the rational, Florida is 1, NC is 35 and Maine is 21.

Download this file, then extract the JSON data and convert. The data starts with a {"st1": and ends with }}.

library(dplyr)

#read the index_php file and turn it into character string
index_php <-readLines("https://gasprices.aaa.com/index.php?premiumhtml5map_js_data=true&map_id=21&r=19770&ver=5.9.3")
index_php <- paste(index_php, collapse = " ")

#extract out the correct JSON data part and convert
jsondata <- stringr::str_extract(index_php, "\{\"st1\":.+?\}\}") 
data<-jsonlite::fromJSON(jsondata)

#create a data frame with the results
answer <- bind_rows(data)

      id name         shortname link  comment image color_map color_map_over
   <int> <chr>        <chr>     <chr> <chr>   <chr> <chr>     <chr>         
 1     1 Androscoggin ""        ""    $4.964  ""    #ca3338   #ca3338       
 2     2 Aroostook    ""        ""    $4.928  ""    #dd7a7a   #dd7a7a       
 3     3 Cumberland   ""        ""    $4.944  ""    #ca3338   #ca3338       
 4     4 Franklin     ""        ""    $4.936  ""    #dd7a7a   #dd7a7a       
 5     5 Hancock      ""        ""    $4.900  ""    #01b5da   #01b5da       
 6     6 Kennebec     ""        ""    $4.955  ""    #ca3338   #ca3338

There are some extra columns which need removal, I leave it as an exercise for the reader.

QHarr · Answer

So, you can gather the state info, including state level prices from the initial US page. You can also, from there, gather the urls for each state page. Make a request to each of those pages, and store the returned html. You can then, depending on whether the county data is in a php file, either extract the php file links, request that file and process out the info you want, or, in the case of no php file, extract the necessary data from the html already stored from the state requests.

Below extracts all the prices for all states and counties. There is a state DataFrame and a state with counties DataFrame.

library(tidyverse)
library(rvest)

get_data <- function(state, url) {
  # extract county and price data from php files. Pass in state abbreviation and php file URI. 
  s <- read_html(url) %>%
    html_text() %>%
    str_match("map_data\s+:\s+(.*\}),") %>%
    .[, 2]
  return(
    tibble(
      state = state,
      county = s %>% str_match_all(',"name":"(.*?)"') %>% .[[1]] %>% .[, 2],
      price = s %>% str_match_all(',"comment":"(.*?)"') %>% .[[1]] %>% .[, 2]
    )
  )
}

start_url <- "https://gasprices.aaa.com/?state=US"
page <- read_html(start_url)

# get state price info and urls for state pages
data_strings <- page %>%
  html_text() %>%
  stringr::str_match('placestxt = (".*")') %>%
  .[, 2] %>%
  str_replace_all('\"', "") %>%
  str_split(";")

df_state <- data.frame(subset(data_strings[[1]], lapply(data_strings, function(x) {
  x != ""
})[[1]]) %>% map(., ~ str_split(.x, ",")) %>% unlist(recursive = F)) %>%
  transpose() %>%
  .[c(1:4)] %>%
  set_names("abbr", "state", "price", "url")

state_data <- lapply(df_state$url, read_html)

# find the php file links
df_state$data_url <- lapply(state_data, function(item) {
  item %>%
    html_element("[src*=js_data]") %>%
    html_attr("src")
})

# separate out dataframe according to whether county data is in php file or in previously stored html
no_valid_data_url <- df_state %>% filter(is.na(data_url))
has_valid_data_url <- df_state %>% filter(!is.na(data_url))

# grab the data for states where there are php files with county info
df_state_county <- map2_dfr(has_valid_data_url$state, has_valid_data_url$data_url, get_data)

# add in missing info i.e. #  handle cases where data_url is NA e.g. https://gasprices.aaa.com/?state=DC
if (nrow(no_valid_data_url) > 0) {
  html_to_use <- state_data[match(no_valid_data_url$abbr, df_state$abbr)]
  df_state_county_no_data_url <- map_dfr(html_to_use, function(html) {
    state_node <- html %>% html_element(".selected")
    state_text <- state_node %>% html_text(trim = T)
    return(
      data.frame(
        state = state_text,
        county = state_text,
        price = html %>% html_element('td:contains("Current Avg.") + td') %>% html_text()
      )
    )
  })
  df_state_county <- rbind(df_state_county, df_state_county_no_data_url)
}


head(df_state, 2)
head(df_state_county, 2)

How to extract webpage data with a node and class in rvest

Tags:

css

r

web-scraping

rvest

flâneur

2 Answers

Dave2e

QHarr

Recent Activity

Donate For Us

How to extract webpage data with a node and class in rvest

Tags:

css

r

web-scraping

rvest

flâneur

2 Answers

Dave2e

QHarr

Related questions

Recent Activity

Donate For Us