Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Connecting IPython notebook to spark master running in different machines

I don't know if this is already answered in SO but I couldn't find a solution to my problem.

I have an IPython notebook running in a docker container in Google Container Engine, the container is based on this image jupyter/all-spark-notebook

I have also a spark cluster created with google cloud dataproc

Spark master and the notebook are running in different VMs but in the same region and zone.

My problem is that I'm trying to connect to the spark master from the IPython notebook but without success. I use this snippet of code in my python notebook

import pyspark
conf = pyspark.SparkConf()
conf.setMaster("spark://<spark-master-ip or spark-master-hostname>:7077")

I just started working with spark, so I'm sure I'm missing something (authentication, security ...),

What I found over there is connecting a local browser over an SSH tunnel

Somebody already did this kind of set up?

Thank you in advance

like image 259
med Avatar asked Feb 25 '16 08:02

med


1 Answers

Dataproc runs Spark on YARN, so you need to set master to 'yarn-client'. You also need to point Spark at your YARN ResourceManager, which requires a under-documented SparkConf -> Hadoop Configuration conversion. You also have to tell Spark about HDFS on the cluster, so it can stage resources for YARN. You could use Google Cloud Storage instead of HDFS, if you baked The Google Cloud Storage Connector for Hadoop into your image.

Try:

import pyspark
conf = pyspark.SparkConf()
conf.setMaster('yarn-client')
conf.setAppName('My Jupyter Notebook')

# 'spark.hadoop.foo.bar' sets key 'foo.bar' in the Hadoop Configuaration.
conf.set('spark.hadoop.yarn.resourcemanager.address', '<spark-master-hostname>')
conf.set('spark.hadoop.fs.default.name', 'hdfs://<spark-master-hostname>/')

sc = pyspark.SparkContext(conf=conf)

For a more permanent config, you could bake these into a local file 'core-site.xml' as described here, place that in a local directory, and set HADOOP_CONF_DIR to that directory in your environment.

It's also worth noting that while being in the same Zone is important for performance, it is being in the same Network and allowing TCP between internal IP addresses in that network that allows your VMs to communicate. If you are using the default network, then the default-allow-internal firewall rule, should be sufficient.

Hope that helps.

like image 124
Patrick Clay Avatar answered Oct 28 '22 17:10

Patrick Clay