Exception handling RSelenium switchToFrame() Error: ElementNotVisible

Tags:

I'm trying to implement exception handling in RSelenium and need help please. Please be aware that I have checked permissions to crawl this page with the robotstxt package.

library(RSelenium)
library(XML)
library(janitor)
library(lubridate)
library(magrittr)
library(dplyr)

remDr <- remoteDriver(
  remoteServerAddr = "192.168.99.100",
  port = 4445L
)
remDr$open()

# Open TightVNC to follow along as RSelenium drives the browser

# navigate to the main page
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")

# look for table element
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")

# switch to table
remDr$switchToFrame(tableElem)

# parse html for first table
doc <- htmlParse(remDr$getPageSource()[[1]])
table_tmp <- readHTMLTable(doc)
table_tmp <- table_tmp[[1]][-2, -1]
table_tmp <- table_tmp[-1, ]
colnames(table_tmp) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
table_tmp$city <- rep("montreal", nrow(table_tmp))
table_tmp$date <- rep(Sys.Date() - 5, nrow(table_tmp))

# switch back to the main/outer frame
remDr$switchToFrame(NULL)

# I found the elements I want to manipulate with Inspector mode in a browser
webElems <- remDr$findElements(using = "css", ".switcherItem") # Month/Year tabs at the bottom
arrowElems <- remDr$findElements(using = "css", ".switcherArrows") # Arrows to scroll left and right at the bottom

# Create NULL object to be used in for loop
big_df <- NULL
for (i in seq(length(webElems))) {

  # choose the i'th Month/Year tab
  webElem <- webElems[[i]]
  webElem$clickElement()

  tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # The inner table frame

  # switch to table frame
  remDr$switchToFrame(tableElem)
  Sys.sleep(3)
  # parse html with XML package
  doc <- htmlParse(remDr$getPageSource()[[1]])
  Sys.sleep(3)
  # Extract data from HTML table in HTML document
  table_tmp <- readHTMLTable(doc)
  Sys.sleep(3)
  # put this into a format you can use
  table <- table_tmp[[1]][-2, -1]
  table <- table[-1, ]
  # rename the columns
  colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
  # add city name to a column
  table$city <- rep("Montreal", nrow(table))

  # add the Month/Year this table was extracted from
  today <- Sys.Date() %m-% months(i + 1)
  table$date <- today

  # concatenate each table together
  big_df <- dplyr::bind_rows(big_df, table)

  # Switch back to main frame
  remDr$switchToFrame(NULL)

  ################################################
  ###   I should use exception handling here   ###
  ################################################


}

When the browser gets to the January 2018 table it can no longer find the next webElems element and throws and error:

enter image description here

Selenium message:Element is not currently visible and so may not be interacted with Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03' System info: host: '617e51cbea11', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '4.14.79-boot2docker', java.version: '1.8.0_91' Driver info: driver.version: unknown

Error: Summary: ElementNotVisible Detail: An element command could not be completed because the element is not visible on the page. class: org.openqa.selenium.ElementNotVisibleException Further Details: run errorDetails method In addition: There were 50 or more warnings (use warnings() to see the first 50)

I've been dealing with it rather naively by including this code at the end of the for loop. This is not a good idea for two reasons: 1) the scrolling speed was finicky to figure out and would fail on other (longer) google pages, 2) the for loop eventually fails at the end when it tries to click the right arrow but it's already at the end - therefore it won't download the last few tables.

# click the right arrow to scroll right
arrowElem <- arrowElems[[1]]
# once you "click"" the element it is "held down" - no way to " unclick" to prevent it from scrolling too far
# I currently make sure it only scrolls a short distance - via Sys.sleep() before switching to outer frame
arrowElem$clickElement()
# give it "just enough time" to scroll right
Sys.sleep(0.3)
# switch back to outer frame to re-start the loop
remDr$switchToFrame(NULL)

What I would like to have happen is handle this exception by executing arrowElem$clickElement() when this error pops up. I think one would typically use tryCatch(); however, this is also my first time learning about exception handling. I thought I could include this in the remDr$switchToFrame(tableElem) part of the for loop but it doesn't work:

tryCatch({
        suppressMessages({
            remDr$switchToFrame(tableElem)
        })
    },
    error = function(e) {
        arrowElem <- arrowElems[[1]]
        arrowElem$clickElement()
        Sys.sleep(0.3)
        remDr$switchToFrame(NULL)
    }
)

897

asked Jan 08 '19 02:01

Matthew J. Oldach

1 Answers

I gave it a try. When exceptions handling, I like to use something of the form

check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE))
if (any(class(check) == "try-error")) {
  # do stuff
}

I find it convenient of use and it usually works fine, including when using selenium. The issue encountered here however is clicking on the arrow once would always bring me to the last visible sheets - skipping everything in middle.

Alternative Solution

So here is an alternative that will solve the task of * scraping the tables* not the task of exception handling in the above sense.

The code

# Alernative: -------------------------------------------------------------

remDr <- RSelenium::remoteDriver(
  remoteServerAddr = "192.168.99.100",
  port = 4445L
)
remDr$open(silent = TRUE)
# navigate to the main page
# needs no be done once before looping, else content is not available
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")


# I. Preliminaries:
# 
# 1. build the links to all spreadsheets
# 2. define the function create_table
# 
# 1.
# get page source
html <- remDr$getPageSource()[[1]]
# split it line by line
html <- unlist(strsplit(html, '\n'))
# restrict to script section
script <- grep('^\\s*var\\s+gidMatch', html, value = TRUE)
# split the script by semi-colon
script <- unlist(strsplit(script, ';'))
# retrieve information
sheet_months <- gsub('.*name:.{2}(.*?).{1},.*', '\\1', 
                     grep('\\{name\\s*\\:', script, value = TRUE), perl = TRUE)
sheet_gid <- gsub('.*gid:.{2}(.*?).{1},.*', '\\1', 
                  grep('\\gid\\s*\\:', script, value = TRUE), perl = TRUE)
sheet_url <- paste0('https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers%5Cx3dfalse&gid=',
                    sheet_gid)
#
# 2. 
# table yielding function
# just for readability in the loop
create_table <- function (remDr) {
  # parse html with XML package
  doc <- XML::htmlParse(remDr$getPageSource()[[1]])
  Sys.sleep(3)
  # Extract data from HTML table in HTML document
  table_tmp <- XML::readHTMLTable(doc)
  Sys.sleep(3)
  # put this into a format you can use
  table <- table_tmp[[1]][-2, -1]
  # add a check-up for size mismatch
  table_fields <- as.character(t(table[1,]))
  if (! any(grepl("size", tolower(table_fields)))) {
    table <- table[-1, ]
    # rename the columns
    colnames(table) <- c("team_name", "start_time", "end_time", "total_time", "puzzels_solved")
    table$team_size <- NA_integer_
    table <- table[,c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")]
  } else {
    table <- table[-1, ]
    # rename the columns
    colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
  }
  # add city name to a column
  table$city <- rep("Montreal", nrow(table))
  
  # add the Month/Year this table was extracted from
  today <- Sys.Date()
  lubridate::month(today) <- lubridate::month(today)+1
  table$date <- today
  
  # returns the table
  table
}

# II. Scrapping the content
# 
# 1. selenium to generate the pages
# 2. use create_table to extract the table
# 
big_df <- NULL
for (k in seq_along(sheet_url)) {
  # 1. navigate to the page
  remDr$navigate(sheet_url[k])
  # remDr$screenshot(display = TRUE) maybe one wants to see progress
  table <- create_table(remDr)
  
  # 2. concatenate each table together
  big_df <- dplyr::bind_rows(big_df, table)
  
  # inform progress 
  cat(paste0('\nGathered table for: \t', sheet_months[k]))
}

# close session
remDr$close()

Result

Here you can see the head and tail of big_df

head(big_df)
#                             team_name team_size start_time end_time total_time puzzels_solved     city       date
# 1                     Tortoise Tortes         5      19:00    20:05       1:05              5 Montreal 2019-02-20
# 2 Mulholland Drives Over A Smelly Cat         4       7:25     8:48       1:23              5 Montreal 2019-02-20
# 3                          B.R.O.O.K.         2       7:23     9:05       1:42              5 Montreal 2019-02-20
# 4                            Motivate         4      18:53    20:37       1:44              5 Montreal 2019-02-20
# 5                  Fighting Mongooses         3       6:31     8:20       1:49              5 Montreal 2019-02-20
# 6                            B Lovers         3       6:40     8:30       1:50              5 Montreal 2019-02-20
tail(big_df)
#                             team_name team_size start_time end_time total_time puzzels_solved     city       date
# 545                          Ale Mary      <NA>       6:05     7:53       1:48              5 Montreal 2019-02-20
# 546                        B.R.O.O.K.      <NA>      18:45    20:37       1:52              5 Montreal 2019-02-20
# 547                        Ridler Co.      <NA>       6:30     8:45       2:15              5 Montreal 2019-02-20
# 548                        B.R.O.O.K.      <NA>      18:46    21:51       3:05              5 Montreal 2019-02-20
# 549        Rotating Puzzle Collective      <NA>      18:45    21:51       3:06              5 Montreal 2019-02-20
# 550                         Fire Team      <NA>      19:00    22:11       3:11              5 Montreal 2019-02-20

Short Explanation

To perform the task, what I've done is firstly generating the links to all spreadsheets in the document. To do this:
- Navigate once to the document
- Extract source code
- Extract the sheet months and URLs (via gid digit) using regex
Once this is done, loop through the Urls, gather and bind the tables

Also, for readability purposes, I created a small function called create_table which while return the table in the proper format. It is mainly the code included in your loop. I only added a safety measure for the number of columns (some of the spreadsheets do not have the team_size field - in those cases I set it to NA_integer).

answered Oct 03 '22 02:10

niko

Related questions
                            
                                Values getting dropped from ggplot2 histogram when specifying limits
                            
                                Change y-axis of dot plot to reflect actual count using geom_dotplot
                            
                                resizing Flex dashboard gauge within a shiny app
                            
                                Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90
                            
                                add error to a transformed vector of dependent values
                            
                                Is there any other way to implement plotly R’s cumulative animation?
                            
                                Sequential evaluation of named arguments in R
                            
                                Switching between reactive data sets of different formats with rhandsontable
                            
                                Extract contents from a nested list in R
                            
                                Passing default arguments through nested functions in R
                            
                                Any workaround to construct temperature distribution over multi-layers raster in R
                            
                                seeking explanation for as.Date() function in R
                            
                                Error in tagAssert Creating a Shiny Dashboard
                            
                                How can I record exercise submission in a learnr R tutorial?
                            
                                Using ggplot to plot shapefile and gganimate for animation
                            
                                Retrieve data with DDE in R using tcltk2 library
                            
                                Plotting posterior parameter estimates from multiple models with bayesplot
                            
                                R implementation of Genetic Algorithm for Bass Model
                            
                                ambiguity of `<<-` when defining it for `x < y <- z`
                            
                                In R, how do I split timestamp interval data into regular slots?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Exception handling RSelenium switchToFrame() Error: ElementNotVisible

Tags:

exception-handling

r

try-catch

rselenium

Matthew J. Oldach

People also ask

1 Answers

Alternative Solution

niko

Recent Activity

Donate For Us