I'm new to Spark and GCP. I've tried to connect to it with
sc <- spark_connect(master = "IP address")
but it obviously couldn't work (e.g. there is no authentication).
How should I do that? Is it possible to connect to it from outside Google Cloud?
There are two issues with connecting to Spark on Dataproc from outside a cluster: Configuration and Network access. It is generally somewhat difficult and not fully supported, So I would recommend using sparklyr inside the cluster.
Google Cloud Dataproc runs Spark on Hadoop YARN. You actually need to use yarn-cluster:
sc <- spark_connect(master = 'yarn-client')
However you also need a yarn-site.xml
in your $SPARK_HOME
directory to point Spark to the right hostname.
While you can open ports to your IP address using firewall rules on your Google Compute Engine network, it's not considered a good security practice. You would also need to configure YARN to use the instance's external IP address or have a way to resolve hostnames on your machine.
sparklyr can be installed and run with R REPL by SSHing into the master node and running:
$ # Needed for the curl library
$ sudo apt-get install -y libcurl4-openssl-dev
$ R
> install.packages('sparklyr')
> library(sparklyr)
> sc <- spark_connect(master = 'yarn-client')
I believe RStudio Server supports SOCKS proxies, which can be set up as described here, but I am not very familiar with RStudio.
I use Apache Zeppelin on Dataproc for R notebooks, but it autoloads SparkR, which I don't think plays well with sparklyr at this time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With