Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Connect to spark cluster from local jupyter notebook

I try to connect to remote spark master from notebook on my local machine.

When I try creating sparkContext

sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077", 
                          appName="jupyter notebook_test"),

I get following exception:

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
    134         try:
    135             self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
--> 136                           conf, jsc, profiler_cls)
    137         except:
    138             # If an error occurs, clean up in order to allow future SparkContext creation:

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, jsc, profiler_cls)
    196 
    197         # Create the Java SparkContext through Py4J
--> 198         self._jsc = jsc or self._initialize_context(self._conf._jconf)
    199         # Reset the SparkConf to the one actually used by the SparkContext in JVM.
    200         self._conf = SparkConf(_jconf=self._jsc.sc().conf())

/opt/.venv/lib/python3.7/site-packages/pyspark/context.py in _initialize_context(self, jconf)
    304         Initialize SparkContext in function to allow subclass specific initialization
    305         """
--> 306         return self._jvm.JavaSparkContext(jconf)
    307 
    308     @classmethod

/opt/.venv/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1523         answer = self._gateway_client.send_command(command)
   1524         return_value = get_return_value(
-> 1525             answer, self._gateway_client, None, self._fqn)
   1526 
   1527         for temp_arg in temp_args:

/opt/.venv/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
    at scala.Predef$.require(Predef.scala:224)
    at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:516)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)

At the same time, I can create spark context using the same interpreter in interactive mode.

What I should do to connect to remote spark master from my local jupyter notebook?

like image 634
Grigory Skvortsov Avatar asked Apr 27 '20 11:04

Grigory Skvortsov


People also ask

Can you use Spark in Jupyter notebook?

Jupyter also supports Big data tools such as Apache Spark for data analytics needs. (Read our comprehensive intro to Jupyter Notebooks.)

How do I access PySpark in Jupyter?

There are two ways to get PySpark available in a Jupyter Notebook: Configure PySpark driver to use Jupyter Notebook: running pyspark will automatically open a Jupyter Notebook. Load a regular Jupyter Notebook and load PySpark using findSpark package.

How do I connect to a Spark cluster?

Connecting an Application to the Cluster To run an application on the Spark cluster, simply pass the spark://IP:PORT URL of the master as to the SparkContext constructor. You can also pass an option --total-executor-cores <numCores> to control the number of cores that spark-shell uses on the cluster.


1 Answers

I solved my problem using @HristoIliev advice. In my case, PYSPARK_PYTHON was not set inside the jupyter environment. Simple solution:

import os
os.environ["PYSPARK_PYTHON"] = '/opt/.venv/bin/python'
os.environ["SPARK_HOME"] = '/opt/spark'

Also you can use findspark for this, but I didn't test it.

like image 57
Grigory Skvortsov Avatar answered Oct 10 '22 00:10

Grigory Skvortsov