Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Connect sparklyr to remote spark connection

I would like to connect my local desktop RStudio session to a remote spark session via sparklyr. When you go to add a new connection in the sparklyr ui tab in RStudio and choose cluster is says that you have to be running on the cluster, or have a high bandwidth connection to the cluster.

Can anyone shed light on how to create that kind of connection? I am not sure how to create reproducible example of this, but in general what I would like to do is:

library(sparklyr)
sc <- spark_connect(master = "spark://ip-[MY_PRIVATE_IP]:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")

from a remote server. I understand that there will be latency, especially if trying to pass data between the remotes. I also understand that it would be better to have the rstudio-server on the actual cluster- but that is not always possible, and I am looking for a sparklyr option for interacting between my server and my desktop RStudio session. Thanks.

like image 925
Jim Crozier Avatar asked Sep 30 '16 19:09

Jim Crozier


People also ask

How do I connect to a remote Spark Server?

You can create a spark session by specifying the IP address of the remote master. It doesn't seem possible according to the docs: "It's not possible to submit a Spark application to a remote Amazon EMR cluster with the following command: SparkConf conf = new SparkConf(). setMaster("spark://<master url>:7077”).

How do I use Sparklyr in Databricks?

Connect sparklyr to Azure Databricks clusters To establish a sparklyr connection, you can use "databricks" as the connection method in spark_connect() . No additional parameters to spark_connect() are needed, nor is calling spark_install() needed because Spark is already installed on an Azure Databricks cluster.

How do you use SparkR in R studio?

Starting Up from RStudio You can also start SparkR from RStudio. You can connect your R program to a Spark cluster from RStudio, R shell, Rscript or other R IDEs. To start, make sure SPARK_HOME is set in environment (you can check Sys. getenv), load the SparkR package, and call sparkR.

What is a Spark connection?

Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere.


2 Answers

As of sparklyr version 0.4, it is unsupported to connect from the RStudio desktop to a remote Spark cluster. Instead, as you mention, the recommended approach is to install RStudio Server within the Spark cluster.

That said, the livy branch in sparklyr is exploring integration with Livy that would enable the RStudio desktop to connect to a remote Spark cluster through Livy.

like image 52
Javier Luraschi Avatar answered Sep 21 '22 00:09

Javier Luraschi


Using more recent version of sparklyr (version 0.9.2 for example) it's possible to connect to a remote Spark cluster.

Here is an example to connect to a Spark standalone cluster version 2.3.1. See Master URLs for other master URL schemes.

#install.packages("sparklyr")
library(sparklyr)

# You have to install locally (on the driver where RStudio is running) the same Spark version
spark_v <- "2.3.1"
cat("Installing Spark in the directory:", spark_install_dir())
spark_install(version = spark_v)

sc <- spark_connect(spark_home = spark_install_find(version=spark_v)$sparkVersionDir, 
                    master = "spark://ip-[MY_PRIVATE_IP]:7077")

sc$master
# "spark://ip-[MY_PRIVATE_IP]:7077"

I've written a post on this topic.

like image 22
Romain Avatar answered Sep 23 '22 00:09

Romain