Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exception handling RSelenium switchToFrame() Error: ElementNotVisible

I'm trying to implement exception handling in RSelenium and need help please. Please be aware that I have checked permissions to crawl this page with the robotstxt package.

library(RSelenium)
library(XML)
library(janitor)
library(lubridate)
library(magrittr)
library(dplyr)

remDr <- remoteDriver(
  remoteServerAddr = "192.168.99.100",
  port = 4445L
)
remDr$open()

# Open TightVNC to follow along as RSelenium drives the browser

# navigate to the main page
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")

# look for table element
tableElem <- remDr$findElement(using = "id", "pageswitcher-content")

# switch to table
remDr$switchToFrame(tableElem)

# parse html for first table
doc <- htmlParse(remDr$getPageSource()[[1]])
table_tmp <- readHTMLTable(doc)
table_tmp <- table_tmp[[1]][-2, -1]
table_tmp <- table_tmp[-1, ]
colnames(table_tmp) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
table_tmp$city <- rep("montreal", nrow(table_tmp))
table_tmp$date <- rep(Sys.Date() - 5, nrow(table_tmp))

# switch back to the main/outer frame
remDr$switchToFrame(NULL)

# I found the elements I want to manipulate with Inspector mode in a browser
webElems <- remDr$findElements(using = "css", ".switcherItem") # Month/Year tabs at the bottom
arrowElems <- remDr$findElements(using = "css", ".switcherArrows") # Arrows to scroll left and right at the bottom

# Create NULL object to be used in for loop
big_df <- NULL
for (i in seq(length(webElems))) {

  # choose the i'th Month/Year tab
  webElem <- webElems[[i]]
  webElem$clickElement()

  tableElem <- remDr$findElement(using = "id", "pageswitcher-content") # The inner table frame

  # switch to table frame
  remDr$switchToFrame(tableElem)
  Sys.sleep(3)
  # parse html with XML package
  doc <- htmlParse(remDr$getPageSource()[[1]])
  Sys.sleep(3)
  # Extract data from HTML table in HTML document
  table_tmp <- readHTMLTable(doc)
  Sys.sleep(3)
  # put this into a format you can use
  table <- table_tmp[[1]][-2, -1]
  table <- table[-1, ]
  # rename the columns
  colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
  # add city name to a column
  table$city <- rep("Montreal", nrow(table))

  # add the Month/Year this table was extracted from
  today <- Sys.Date() %m-% months(i + 1)
  table$date <- today

  # concatenate each table together
  big_df <- dplyr::bind_rows(big_df, table)

  # Switch back to main frame
  remDr$switchToFrame(NULL)

  ################################################
  ###   I should use exception handling here   ###
  ################################################


}

When the browser gets to the January 2018 table it can no longer find the next webElems element and throws and error:

enter image description here

Selenium message:Element is not currently visible and so may not be interacted with Build info: version: '2.53.1', revision: 'a36b8b1', time: '2016-06-30 17:37:03' System info: host: '617e51cbea11', ip: '172.17.0.2', os.name: 'Linux', os.arch: 'amd64', os.version: '4.14.79-boot2docker', java.version: '1.8.0_91' Driver info: driver.version: unknown

Error: Summary: ElementNotVisible Detail: An element command could not be completed because the element is not visible on the page. class: org.openqa.selenium.ElementNotVisibleException Further Details: run errorDetails method In addition: There were 50 or more warnings (use warnings() to see the first 50)

I've been dealing with it rather naively by including this code at the end of the for loop. This is not a good idea for two reasons: 1) the scrolling speed was finicky to figure out and would fail on other (longer) google pages, 2) the for loop eventually fails at the end when it tries to click the right arrow but it's already at the end - therefore it won't download the last few tables.

# click the right arrow to scroll right
arrowElem <- arrowElems[[1]]
# once you "click"" the element it is "held down" - no way to " unclick" to prevent it from scrolling too far
# I currently make sure it only scrolls a short distance - via Sys.sleep() before switching to outer frame
arrowElem$clickElement()
# give it "just enough time" to scroll right
Sys.sleep(0.3)
# switch back to outer frame to re-start the loop
remDr$switchToFrame(NULL)

What I would like to have happen is handle this exception by executing arrowElem$clickElement() when this error pops up. I think one would typically use tryCatch(); however, this is also my first time learning about exception handling. I thought I could include this in the remDr$switchToFrame(tableElem) part of the for loop but it doesn't work:

tryCatch({
        suppressMessages({
            remDr$switchToFrame(tableElem)
        })
    },
    error = function(e) {
        arrowElem <- arrowElems[[1]]
        arrowElem$clickElement()
        Sys.sleep(0.3)
        remDr$switchToFrame(NULL)
    }
)
like image 897
Matthew J. Oldach Avatar asked Jan 08 '19 02:01

Matthew J. Oldach


People also ask

How do you handle an element not visible exception?

First Solution: Try to write unique XPATH that matches with a single element only. Second Solution: Use Explicit wait feature of Selenium and wait till the element is not visible. Once it is visible then you can perform your operations.

How can we handle element which is not visible on the screen because of resolution issue?

Solution: You need to scroll to element using javascript or Actions class so that element is perfectly visible on screen. By using explicit wait and fluent wait as mentioned above.

What exception is thrown when element is not found in Selenium?

New Selenium IDE If an element is not found in an HTML DOM using xpath, then the NoSuchElementException is raised. This exception is thrown when the webdriver makes an attempt to locate a web element which is absent from DOM.

How do you handle no such frame exception in Selenium?

i.e. We have to use try .. catch blocks to handle the exception and also 'NoSuchFrameException' WebDriver Exception Class needs to be used in the catch block as shown in the below code: 2. Hover the mouse over the 'NoSuchFrameException' error in the above image and select 'import NoSuchFrameException org.


1 Answers

I gave it a try. When exceptions handling, I like to use something of the form

check <- try(expression, silent = TRUE) # or suppressMessages(try(expression, silent = TRUE))
if (any(class(check) == "try-error")) {
  # do stuff
}

I find it convenient of use and it usually works fine, including when using selenium. The issue encountered here however is clicking on the arrow once would always bring me to the last visible sheets - skipping everything in middle.


Alternative Solution

So here is an alternative that will solve the task of * scraping the tables* not the task of exception handling in the above sense.

The code

# Alernative: -------------------------------------------------------------

remDr <- RSelenium::remoteDriver(
  remoteServerAddr = "192.168.99.100",
  port = 4445L
)
remDr$open(silent = TRUE)
# navigate to the main page
# needs no be done once before looping, else content is not available
remDr$navigate("https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pub?output=html&widget=true#gid=690408156")


# I. Preliminaries:
# 
# 1. build the links to all spreadsheets
# 2. define the function create_table
# 
# 1.
# get page source
html <- remDr$getPageSource()[[1]]
# split it line by line
html <- unlist(strsplit(html, '\n'))
# restrict to script section
script <- grep('^\\s*var\\s+gidMatch', html, value = TRUE)
# split the script by semi-colon
script <- unlist(strsplit(script, ';'))
# retrieve information
sheet_months <- gsub('.*name:.{2}(.*?).{1},.*', '\\1', 
                     grep('\\{name\\s*\\:', script, value = TRUE), perl = TRUE)
sheet_gid <- gsub('.*gid:.{2}(.*?).{1},.*', '\\1', 
                  grep('\\gid\\s*\\:', script, value = TRUE), perl = TRUE)
sheet_url <- paste0('https://docs.google.com/spreadsheets/d/1o1PlLIQS8v-XSuEz1eqZB80kcJk9xg5lsbueB7mTg1U/pubhtml/sheet?headers%5Cx3dfalse&gid=',
                    sheet_gid)
#
# 2. 
# table yielding function
# just for readability in the loop
create_table <- function (remDr) {
  # parse html with XML package
  doc <- XML::htmlParse(remDr$getPageSource()[[1]])
  Sys.sleep(3)
  # Extract data from HTML table in HTML document
  table_tmp <- XML::readHTMLTable(doc)
  Sys.sleep(3)
  # put this into a format you can use
  table <- table_tmp[[1]][-2, -1]
  # add a check-up for size mismatch
  table_fields <- as.character(t(table[1,]))
  if (! any(grepl("size", tolower(table_fields)))) {
    table <- table[-1, ]
    # rename the columns
    colnames(table) <- c("team_name", "start_time", "end_time", "total_time", "puzzels_solved")
    table$team_size <- NA_integer_
    table <- table[,c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")]
  } else {
    table <- table[-1, ]
    # rename the columns
    colnames(table) <- c("team_name", "team_size", "start_time", "end_time", "total_time", "puzzels_solved")
  }
  # add city name to a column
  table$city <- rep("Montreal", nrow(table))
  
  # add the Month/Year this table was extracted from
  today <- Sys.Date()
  lubridate::month(today) <- lubridate::month(today)+1
  table$date <- today
  
  # returns the table
  table
}

# II. Scrapping the content
# 
# 1. selenium to generate the pages
# 2. use create_table to extract the table
# 
big_df <- NULL
for (k in seq_along(sheet_url)) {
  # 1. navigate to the page
  remDr$navigate(sheet_url[k])
  # remDr$screenshot(display = TRUE) maybe one wants to see progress
  table <- create_table(remDr)
  
  # 2. concatenate each table together
  big_df <- dplyr::bind_rows(big_df, table)
  
  # inform progress 
  cat(paste0('\nGathered table for: \t', sheet_months[k]))
}

# close session
remDr$close()

Result

Here you can see the head and tail of big_df

head(big_df)
#                             team_name team_size start_time end_time total_time puzzels_solved     city       date
# 1                     Tortoise Tortes         5      19:00    20:05       1:05              5 Montreal 2019-02-20
# 2 Mulholland Drives Over A Smelly Cat         4       7:25     8:48       1:23              5 Montreal 2019-02-20
# 3                          B.R.O.O.K.         2       7:23     9:05       1:42              5 Montreal 2019-02-20
# 4                            Motivate         4      18:53    20:37       1:44              5 Montreal 2019-02-20
# 5                  Fighting Mongooses         3       6:31     8:20       1:49              5 Montreal 2019-02-20
# 6                            B Lovers         3       6:40     8:30       1:50              5 Montreal 2019-02-20
tail(big_df)
#                             team_name team_size start_time end_time total_time puzzels_solved     city       date
# 545                          Ale Mary      <NA>       6:05     7:53       1:48              5 Montreal 2019-02-20
# 546                        B.R.O.O.K.      <NA>      18:45    20:37       1:52              5 Montreal 2019-02-20
# 547                        Ridler Co.      <NA>       6:30     8:45       2:15              5 Montreal 2019-02-20
# 548                        B.R.O.O.K.      <NA>      18:46    21:51       3:05              5 Montreal 2019-02-20
# 549        Rotating Puzzle Collective      <NA>      18:45    21:51       3:06              5 Montreal 2019-02-20
# 550                         Fire Team      <NA>      19:00    22:11       3:11              5 Montreal 2019-02-20

Short Explanation

  1. To perform the task, what I've done is firstly generating the links to all spreadsheets in the document. To do this:

    • Navigate once to the document
    • Extract source code
    • Extract the sheet months and URLs (via gid digit) using regex
  2. Once this is done, loop through the Urls, gather and bind the tables

Also, for readability purposes, I created a small function called create_table which while return the table in the proper format. It is mainly the code included in your loop. I only added a safety measure for the number of columns (some of the spreadsheets do not have the team_size field - in those cases I set it to NA_integer).

like image 64
niko Avatar answered Oct 03 '22 02:10

niko