I am using latest version of R on windows 7.
I would like to run many test in parallel using RSelenium
so, my question is:
RSelenium
tests? Let's say I would like to run 1000 tests and each step takes 1 hour. Running tests one by one takes lot's of time (24 test per day, so in total cca 42 days). I know how to use doParallel and foreach package to run tests in parallel on my machine: Run RSelenium in parallel, but sometimes, this is not enough. I would like like to run around 100 tests in parallel. I tried to use Azure Batch for that, but get lot's of errors on some nodes when starting the selenium server.
More concretely, I have written dockerfile:
FROM rocker/r-base:latest RUN apt-get update \ && apt-get install -y --no-install-recommends \ libxml2-dev \ libcurl4-openssl-dev \ libssl-dev \ gnupg2 \ libfftw3-dev \ libtiff-dev \ libx11-dev \ libcairo2-dev \ libxt-dev \ firefox #RUN add-apt-repository -y ppa:mozillateam/firefox-next ## Install Java RUN echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \ | tee /etc/apt/sources.list.d/webupd8team-java.list \ && echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" \ | tee -a /etc/apt/sources.list.d/webupd8team-java.list \ && apt-key adv --keyserver keyserver.ubuntu.com --recv-keys EEA14886 \ && echo "oracle-java8-installer shared/accepted-oracle-license-v1-1 select true" \ | /usr/bin/debconf-set-selections \ && apt-get update \ && apt-get install -y oracle-java8-installer \ && update-alternatives --display java \ && rm -rf /var/lib/apt/lists/* \ && apt-get clean \ && R CMD javareconf ## make sure Java can be found in rApache and other daemons not looking in R ldpaths RUN echo "/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server/" > /etc/ld.so.conf.d/rJava.conf RUN /sbin/ldconfig # Install the R Packages from CRAN RUN Rscript -e 'install.packages(c("Cairo", "Rcpp", "RSelenium", "httr", "rvest", "imager", "RCurl"))'
I have used doAzureParallel
package to execute many scripts in parallel:
# prepare Azure batch setwd("E:/data/R/web_scraping/zk_ba/azure") library(doAzureParallel) setVerbose(TRUE) setAutoDeleteJob(FALSE) generateCredentialsConfig("credentials.json") setCredentials("credentials.json") generateClusterConfig("cluster.json") cluster <- makeCluster("cluster.json") registerDoAzureParallel(cluster) getDoParWorkers() opt <- list(wait = FALSE) jobId <- foreach( i = 1:n_cluster, # .packages = c("RSelenium", "imager", "httr", "RCurl", "rvest"), # .combine = 'rbind', .errorhandling = "pass", .options.azure = opt, .export = c("metadata", "first_step", "parcele_df", "vlasnici_df", "status_teret_df", "n_cluster") ) %dopar% { library(RSelenium) library(imager) library(httr) library(RCurl) library(rvest) #-----------------------------------# # START SELENIUM AND PREPARE # #-----------------------------------# if (first_step == TRUE) { tryCatch({ rD <<- RSelenium::rsDriver( browser = "firefox", extraCapabilities = list( "moz:firefoxOptions" = list( args = list('--headless') ) ) ) }, error = function(e) NA) driver <<- rD$client driver$open() driver$navigate("http://www.e-grunt.ba/") Sys.sleep(3L) .. }
but this return error on many nodes:
<simpleError in checkError(res): Undefined error in httr call. httr output: Failed to connect to localhost port 4567: Connection refused>
What would be general advice in situations where we need to use RSelenium in lot's of parallel tests?
Introduction. Azure Pipelines is a continuous integration tool used to integrate your test suites. It enables continuous testing, build, and deployment of iterative code changes. CI/CD tools help catch failures ahead of the production stage and mitigate them as they occur.
What is Selenium Grid? Selenium Grid is a smart proxy server that makes it easy to run tests in parallel on multiple machines. This is done by routing commands to remote web browser instances, where one server acts as the hub. This hub routes test commands that are in JSON format to multiple registered Grid nodes.
RSelenium
connects to the Selenium server it sets up on port 4567
by default. As soon as one of the parallel nodes connects to the server via this port, no other node can connect through this port.
A solution is to add the following argument to the rsDriver
in the foreach
loop:
rD <<- RSelenium::rsDriver( port = 4567L + as.integer(i), browser = "firefox", extraCapabilities = list( "moz:firefoxOptions" = list( args = list('--headless') ) )
You may have to check for clashes of the ports with other applications.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With