I would like to connect my local desktop RStudio session to a remote spark session via sparklyr. When you go to add a new connection in the sparklyr ui tab in RStudio and choose cluster is says that you have to be running on the cluster, or have a high bandwidth connection to the cluster.
Can anyone shed light on how to create that kind of connection? I am not sure how to create reproducible example of this, but in general what I would like to do is:
library(sparklyr)
sc <- spark_connect(master = "spark://ip-[MY_PRIVATE_IP]:7077", spark_home = "/home/ubuntu/spark-2.0.0", version="2.0.0")
from a remote server. I understand that there will be latency, especially if trying to pass data between the remotes. I also understand that it would be better to have the rstudio-server on the actual cluster- but that is not always possible, and I am looking for a sparklyr option for interacting between my server and my desktop RStudio session. Thanks.
You can create a spark session by specifying the IP address of the remote master. It doesn't seem possible according to the docs: "It's not possible to submit a Spark application to a remote Amazon EMR cluster with the following command: SparkConf conf = new SparkConf(). setMaster("spark://<master url>:7077”).
Connect sparklyr to Azure Databricks clusters To establish a sparklyr connection, you can use "databricks" as the connection method in spark_connect() . No additional parameters to spark_connect() are needed, nor is calling spark_install() needed because Spark is already installed on an Azure Databricks cluster.
Starting Up from RStudio You can also start SparkR from RStudio. You can connect your R program to a Spark cluster from RStudio, R shell, Rscript or other R IDEs. To start, make sure SPARK_HOME is set in environment (you can check Sys. getenv), load the SparkR package, and call sparkR.
Spark Connect introduces a decoupled client-server architecture for Apache Spark that allows remote connectivity to Spark clusters using the DataFrame API and unresolved logical plans as the protocol. The separation between client and server allows Spark and its open ecosystem to be leveraged from everywhere.
As of sparklyr version 0.4
, it is unsupported to connect from the RStudio desktop to a remote Spark cluster. Instead, as you mention, the recommended approach is to install RStudio Server within the Spark cluster.
That said, the livy branch in sparklyr is exploring integration with Livy that would enable the RStudio desktop to connect to a remote Spark cluster through Livy.
Using more recent version of sparklyr (version 0.9.2
for example) it's possible to connect to a remote Spark cluster.
Here is an example to connect to a Spark standalone cluster version 2.3.1
.
See Master URLs for other master URL schemes.
#install.packages("sparklyr")
library(sparklyr)
# You have to install locally (on the driver where RStudio is running) the same Spark version
spark_v <- "2.3.1"
cat("Installing Spark in the directory:", spark_install_dir())
spark_install(version = spark_v)
sc <- spark_connect(spark_home = spark_install_find(version=spark_v)$sparkVersionDir,
master = "spark://ip-[MY_PRIVATE_IP]:7077")
sc$master
# "spark://ip-[MY_PRIVATE_IP]:7077"
I've written a post on this topic.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With