Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Run Selenium parallel test on Azure batch

Tags:

I am using latest version of R on windows 7.

I would like to run many test in parallel using RSelenium so, my question is:

  • What is the recommended way to run many RSelenium tests?

Let's say I would like to run 1000 tests and each step takes 1 hour. Running tests one by one takes lot's of time (24 test per day, so in total cca 42 days). I know how to use doParallel and foreach package to run tests in parallel on my machine: Run RSelenium in parallel, but sometimes, this is not enough. I would like like to run around 100 tests in parallel. I tried to use Azure Batch for that, but get lot's of errors on some nodes when starting the selenium server.

More concretely, I have written dockerfile:

FROM rocker/r-base:latest   RUN  apt-get update \   && apt-get install -y --no-install-recommends \    libxml2-dev \    libcurl4-openssl-dev \    libssl-dev \    gnupg2 \    libfftw3-dev \    libtiff-dev \    libx11-dev \    libcairo2-dev \    libxt-dev \    firefox  #RUN add-apt-repository -y ppa:mozillateam/firefox-next  ## Install Java  RUN echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \          | tee /etc/apt/sources.list.d/webupd8team-java.list \      && echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \          | tee -a /etc/apt/sources.list.d/webupd8team-java.list \      && apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886 \      && echo "oracle-java8-installer shared/accepted-oracle-license-v1-1 select true" \          | /usr/bin/debconf-set-selections \      && apt-get update \      && apt-get install -y oracle-java8-installer \      && update-alternatives --display java \      && rm -rf /var/lib/apt/lists/* \      && apt-get clean \      && R CMD javareconf   ## make sure Java can be found in rApache and other daemons not looking in R ldpaths  RUN echo "/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/" > /etc/ld.so.conf.d/rJava.conf  RUN /sbin/ldconfig  # Install the R Packages from CRAN RUN Rscript -e 'install.packages(c("Cairo", "Rcpp", "RSelenium", "httr", "rvest", "imager", "RCurl"))' 

I have used doAzureParallel package to execute many scripts in parallel:

# prepare Azure batch setwd("E:/data/R/web_scraping/zk_ba/azure") library(doAzureParallel)  setVerbose(TRUE) setAutoDeleteJob(FALSE) generateCredentialsConfig("credentials.json")  setCredentials("credentials.json") generateClusterConfig("cluster.json") cluster <- makeCluster("cluster.json")  registerDoAzureParallel(cluster)  getDoParWorkers() opt <- list(wait = FALSE)   jobId <- foreach(   i = 1:n_cluster,    # .packages = c("RSelenium", "imager", "httr", "RCurl", "rvest"),   # .combine = 'rbind',   .errorhandling = "pass",   .options.azure = opt,    .export = c("metadata", "first_step", "parcele_df", "vlasnici_df", "status_teret_df", "n_cluster") ) %dopar% {     library(RSelenium)   library(imager)   library(httr)   library(RCurl)   library(rvest)    #-----------------------------------#   #    START SELENIUM AND PREPARE     #   #-----------------------------------#    if (first_step == TRUE) {     tryCatch({       rD <<- RSelenium::rsDriver(         browser = "firefox",         extraCapabilities = list(           "moz:firefoxOptions" = list(             args = list('--headless')           )         )       )     }, error = function(e) NA)     driver <<- rD$client     driver$open()     driver$navigate("http://www.e-grunt.ba/")     Sys.sleep(3L) .. } 

but this return error on many nodes:

<simpleError in checkError(res): Undefined error in httr call. httr output: Failed to connect to localhost port 4567: Connection refused> 

What would be general advice in situations where we need to use RSelenium in lot's of parallel tests?

like image 883
Mislav Avatar asked Nov 26 '18 11:11

Mislav


People also ask

What is pipeline selenium?

Introduction. Azure Pipelines is a continuous integration tool used to integrate your test suites. It enables continuous testing, build, and deployment of iterative code changes. CI/CD tools help catch failures ahead of the production stage and mitigate them as they occur.

What is the Selenium Grid?

What is Selenium Grid? Selenium Grid is a smart proxy server that makes it easy to run tests in parallel on multiple machines. This is done by routing commands to remote web browser instances, where one server acts as the hub. This hub routes test commands that are in JSON format to multiple registered Grid nodes.


1 Answers

RSelenium connects to the Selenium server it sets up on port 4567 by default. As soon as one of the parallel nodes connects to the server via this port, no other node can connect through this port.

A solution is to add the following argument to the rsDriver in the foreach loop:

rD <<- RSelenium::rsDriver(         port = 4567L + as.integer(i),         browser = "firefox",         extraCapabilities = list(           "moz:firefoxOptions" = list(             args = list('--headless')           )         ) 

You may have to check for clashes of the ports with other applications.

like image 103
thorepet Avatar answered Oct 11 '22 18:10

thorepet