I'm trying to implement exception handling in RSelenium
and need help please. Please be aware that I have checked permissions to crawl this page with the robotstxt
package.
library(RSelenium)
library(XML)
library(janitor)
library(lubridate)
library(magrittr)
library(dplyr)
remDr <- remoteDriver(
remoteServerAddr = "192.168.99.100",
port = 4445L
)
remDr$open()
# Open TightVNC to follow along as RSelenium drives the browser
# navigate to the main page
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
# look for table element
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")
# switch to table
remDr$switchToFrame(tableElem)
# parse html for first table
doc <- htmlParse(remDr$getPageSource()[[1]])
table_tmp <- readHTMLTable(doc)
table_tmp <- table_tmp[[1]][-2, -1]
table_tmp <- table_tmp[-1, ]
colnames(table_tmp) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
table_tmp$city <- rep("montreal", nrow(table_tmp))
table_tmp$date <- rep(Sys.Date() - 5, nrow(table_tmp))
# switch back to the main/outer frame
remDr$switchToFrame(NULL)
# I found the elements I want to manipulate with Inspector mode in a browser
webElems <- remDr$findElements(using = "css", ".switcherItem") # Month/Year tabs at the bottom
arrowElems <- remDr$findElements(using = "css", ".switcherArrows") # Arrows to scroll left and right at the bottom
# Create NULL object to be used in for loop
big_df <- NULL
for (i in seq(length(webElems))) {
# choose the i'th Month/Year tab
webElem <- webElems[[i]]
webElem$clickElement()
tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # The inner table frame
# switch to table frame
remDr$switchToFrame(tableElem)
Sys.sleep(3)
# parse html with XML package
doc <- htmlParse(remDr$getPageSource()[[1]])
Sys.sleep(3)
# Extract data from HTML table in HTML document
table_tmp <- readHTMLTable(doc)
Sys.sleep(3)
# put this into a format you can use
table <- table_tmp[[1]][-2, -1]
table <- table[-1, ]
# rename the columns
colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
# add city name to a column
table$city <- rep("Montreal", nrow(table))
# add the Month/Year this table was extracted from
today <- Sys.Date() %m-% months(i + 1)
table$date <- today
# concatenate each table together
big_df <- dplyr::bind_rows(big_df, table)
# Switch back to main frame
remDr$switchToFrame(NULL)
################################################
### I should use exception handling here ###
################################################
}
When the browser gets to the January 2018
table it can no longer find the next webElems
element and throws and error:
Selenium message:Element is not currently visible and so may not be interacted with Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03' System info: host: '617e51cbea11', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '4.14.79-boot2docker', java.version: '1.8.0_91' Driver info: driver.version: unknown
Error: Summary: ElementNotVisible Detail: An element command could not be completed because the element is not visible on the page. class: org.openqa.selenium.ElementNotVisibleException Further Details: run errorDetails method In addition: There were 50 or more warnings (use warnings() to see the first 50)
I've been dealing with it rather naively by including this code at the end of the for loop. This is not a good idea for two reasons: 1) the scrolling speed was finicky to figure out and would fail on other (longer) google pages, 2) the for loop eventually fails at the end when it tries to click the right arrow but it's already at the end - therefore it won't download the last few tables.
# click the right arrow to scroll right
arrowElem <- arrowElems[[1]]
# once you "click"" the element it is "held down" - no way to " unclick" to prevent it from scrolling too far
# I currently make sure it only scrolls a short distance - via Sys.sleep() before switching to outer frame
arrowElem$clickElement()
# give it "just enough time" to scroll right
Sys.sleep(0.3)
# switch back to outer frame to re-start the loop
remDr$switchToFrame(NULL)
What I would like to have happen is handle this exception by executing arrowElem$clickElement()
when this error pops up. I think one would typically use tryCatch()
; however, this is also my first time learning about exception handling. I thought I could include this in the remDr$switchToFrame(tableElem)
part of the for loop but it doesn't work:
tryCatch({
suppressMessages({
remDr$switchToFrame(tableElem)
})
},
error = function(e) {
arrowElem <- arrowElems[[1]]
arrowElem$clickElement()
Sys.sleep(0.3)
remDr$switchToFrame(NULL)
}
)
First Solution: Try to write unique XPATH that matches with a single element only. Second Solution: Use Explicit wait feature of Selenium and wait till the element is not visible. Once it is visible then you can perform your operations.
Solution: You need to scroll to element using javascript or Actions class so that element is perfectly visible on screen. By using explicit wait and fluent wait as mentioned above.
New Selenium IDE If an element is not found in an HTML DOM using xpath, then the NoSuchElementException is raised. This exception is thrown when the webdriver makes an attempt to locate a web element which is absent from DOM.
i.e. We have to use try .. catch blocks to handle the exception and also 'NoSuchFrameException' WebDriver Exception Class needs to be used in the catch block as shown in the below code: 2. Hover the mouse over the 'NoSuchFrameException' error in the above image and select 'import NoSuchFrameException org.
I gave it a try. When exceptions handling, I like to use something of the form
check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE))
if (any(class(check) == "try-error")) {
# do stuff
}
I find it convenient of use and it usually works fine, including when using selenium. The issue encountered here however is clicking on the arrow once would always bring me to the last visible sheets - skipping everything in middle.
So here is an alternative that will solve the task of * scraping the tables* not the task of exception handling in the above sense.
The code
# Alernative: -------------------------------------------------------------
remDr <- RSelenium::remoteDriver(
remoteServerAddr = "192.168.99.100",
port = 4445L
)
remDr$open(silent = TRUE)
# navigate to the main page
# needs no be done once before looping, else content is not available
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")
# I. Preliminaries:
#
# 1. build the links to all spreadsheets
# 2. define the function create_table
#
# 1.
# get page source
html <- remDr$getPageSource()[[1]]
# split it line by line
html <- unlist(strsplit(html, '\n'))
# restrict to script section
script <- grep('^\\s*var\\s+gidMatch', html, value = TRUE)
# split the script by semi-colon
script <- unlist(strsplit(script, ';'))
# retrieve information
sheet_months <- gsub('.*name:.{2}(.*?).{1},.*', '\\1',
grep('\\{name\\s*\\:', script, value = TRUE), perl = TRUE)
sheet_gid <- gsub('.*gid:.{2}(.*?).{1},.*', '\\1',
grep('\\gid\\s*\\:', script, value = TRUE), perl = TRUE)
sheet_url <- paste0('https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers%5Cx3dfalse&gid=',
sheet_gid)
#
# 2.
# table yielding function
# just for readability in the loop
create_table <- function (remDr) {
# parse html with XML package
doc <- XML::htmlParse(remDr$getPageSource()[[1]])
Sys.sleep(3)
# Extract data from HTML table in HTML document
table_tmp <- XML::readHTMLTable(doc)
Sys.sleep(3)
# put this into a format you can use
table <- table_tmp[[1]][-2, -1]
# add a check-up for size mismatch
table_fields <- as.character(t(table[1,]))
if (! any(grepl("size", tolower(table_fields)))) {
table <- table[-1, ]
# rename the columns
colnames(table) <- c("team_name", "start_time", "end_time", "total_time", "puzzels_solved")
table$team_size <- NA_integer_
table <- table[,c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")]
} else {
table <- table[-1, ]
# rename the columns
colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
}
# add city name to a column
table$city <- rep("Montreal", nrow(table))
# add the Month/Year this table was extracted from
today <- Sys.Date()
lubridate::month(today) <- lubridate::month(today)+1
table$date <- today
# returns the table
table
}
# II. Scrapping the content
#
# 1. selenium to generate the pages
# 2. use create_table to extract the table
#
big_df <- NULL
for (k in seq_along(sheet_url)) {
# 1. navigate to the page
remDr$navigate(sheet_url[k])
# remDr$screenshot(display = TRUE) maybe one wants to see progress
table <- create_table(remDr)
# 2. concatenate each table together
big_df <- dplyr::bind_rows(big_df, table)
# inform progress
cat(paste0('\nGathered table for: \t', sheet_months[k]))
}
# close session
remDr$close()
Result
Here you can see the head
and tail
of big_df
head(big_df)
# team_name team_size start_time end_time total_time puzzels_solved city date
# 1 Tortoise Tortes 5 19:00 20:05 1:05 5 Montreal 2019-02-20
# 2 Mulholland Drives Over A Smelly Cat 4 7:25 8:48 1:23 5 Montreal 2019-02-20
# 3 B.R.O.O.K. 2 7:23 9:05 1:42 5 Montreal 2019-02-20
# 4 Motivate 4 18:53 20:37 1:44 5 Montreal 2019-02-20
# 5 Fighting Mongooses 3 6:31 8:20 1:49 5 Montreal 2019-02-20
# 6 B Lovers 3 6:40 8:30 1:50 5 Montreal 2019-02-20
tail(big_df)
# team_name team_size start_time end_time total_time puzzels_solved city date
# 545 Ale Mary <NA> 6:05 7:53 1:48 5 Montreal 2019-02-20
# 546 B.R.O.O.K. <NA> 18:45 20:37 1:52 5 Montreal 2019-02-20
# 547 Ridler Co. <NA> 6:30 8:45 2:15 5 Montreal 2019-02-20
# 548 B.R.O.O.K. <NA> 18:46 21:51 3:05 5 Montreal 2019-02-20
# 549 Rotating Puzzle Collective <NA> 18:45 21:51 3:06 5 Montreal 2019-02-20
# 550 Fire Team <NA> 19:00 22:11 3:11 5 Montreal 2019-02-20
Short Explanation
To perform the task, what I've done is firstly generating the links to all spreadsheets in the document. To do this:
gid
digit) using regex
Once this is done, loop through the Urls, gather and bind the tables
Also, for readability purposes, I created a small function called create_table
which while return the table in the proper format. It is mainly the code included in your loop. I only added a safety measure for the number of columns (some of the spreadsheets do not have the team_size
field - in those cases I set it to NA_integer
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With