I am attempting to use this webpage http://volcano.si.edu/search_eruption.cfm to scrape data. There are two drop-down boxes that ask for filters of the data. I do not need filtered data, so I leave those blank and continue on to the next page by clicking "Search Eruptions".
What I have noticed, though, is that the resulting table only includes a small amount of columns (only 5) compared to the total amount of columns (total of 24) it should have. However, all 24 columns will be there if you click the "Download Results to Excel" button and open the downloaded file. This is what I need.
So, it looks like this has turned from a scraping exercise (using httr and rvest) into something more difficult. However, I'm stumped on how to actually "click" on the "Download Results to Excel" button using R. My guess is I will have to use RSelenium, but here is my code trying to use httr with POST in case there is an easier way that any of you kind people can find. I've also tried using gdata, data.table, XML, etc. to no avail which could just be a result of user error.
Also, it might be helpful to know that the download button cannot be right-clicked to show a URL.
url <- "http://volcano.si.edu/database/search_eruption_results.cfm"
searchcriteria <- list(
eruption_category = "",
country = ""
)
mydata <- POST(url, body = "searchcriteria")
Using the Inspector in my browser, I was able to see that the two filters are "eruption_category" and "country" and both will be blank since I do not need any filtered data.
Lastly, it would seem that the above code will get me on to the page that has the table with only 5 columns. However, I was still unable to scrape this table using rvest in the code below (using SelectorGadget to scrape just one column). In the end, this part doesn't matter as much because, as I had said above, I need all 24 columns, not just these 5. But, if you find any errors with what I did below as well, I would be grateful.
Eruptions <- mydata %>%
read_html() %>%
html_nodes(".td8") %>%
html_text()
Eruptions
Thank you for any help you can provide.
Just mimic the POST
it does:
library(httr)
library(rvest)
library(purrr)
library(dplyr)
POST("http://volcano.si.edu/search_eruption_results.cfm",
body = list(bp = "", `eruption_category[]` = "", `country[]` = "", polygon = "", cp = "1"),
encode = "form") -> res
content(res, as="parsed") %>%
html_nodes("div.DivTableSearch") %>%
html_nodes("div.tr") %>%
map(html_children) %>%
map(html_text) %>%
map(as.list) %>%
map_df(setNames, c("volcano_name", "subregion", "eruption_type",
"start_date", "max_vei", "X1")) %>%
select(-X1)
## # A tibble: 750 × 5
## volcano_name subregion eruption_type start_date
## <chr> <chr> <chr> <chr>
## 1 Chirinkotan Kuril Islands Confirmed Eruption 2016 Nov 29
## 2 Zhupanovsky Kamchatka Peninsula Confirmed Eruption 2016 Nov 20
## 3 Kerinci Sumatra Confirmed Eruption 2016 Nov 15
## 4 Langila New Britain Confirmed Eruption 2016 Nov 3
## 5 Cleveland Aleutian Islands Confirmed Eruption 2016 Oct 24
## 6 Ebeko Kuril Islands Confirmed Eruption 2016 Oct 20
## 7 Ulawun New Britain Confirmed Eruption 2016 Oct 11
## 8 Karymsky Kamchatka Peninsula Confirmed Eruption 2016 Oct 5
## 9 Ubinas Peru Confirmed Eruption 2016 Oct 2
## 10 Rinjani Lesser Sunda Islands Confirmed Eruption 2016 Sep 27
## # ... with 740 more rows, and 1 more variables: max_vei <chr>
I assumed the "Excel" part could be inferred, but if not:
POST("http://volcano.si.edu/search_eruption_excel.cfm",
body = list(`eruption_category[]` = "",
`country[]` = ""),
encode = "form",
write_disk("eruptions.xls")) -> res
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With