Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is
jupyter toree install --spark_home=/usr/local/bin/apache-spark/
How can I go about using spark on a remote server without having to install locally?
Apache Toree is a kernel for the Jupyter Notebook platform providing interactive access to Apache Spark. It has been developed using the IPython messaging protocol and 0MQ, and despite the protocol's name, Apache Toree currently exposes the Spark programming model in Scala, Python and R languages.
spark-kernel (homepage) The Spark Kernel enables remote applications to dynamically interact with Apache Spark. It serves as a remote Spark Shell that uses the IPython message protocol to provide a common entrypoint for applications (including IPython itself).
There is indeed a way of getting Toree to connect to a remote Spark cluster.
The easiest way I've discovered is to clone the existing Toree Scala/Python kernel, and create a new Toree Scala/Python Remote kernel. That way you can have the choice of running locally or remotely.
Steps:
Make a copy of the existing kernel. On my particular Toree install, the path to the Kernels was located at: /usr/local/share/jupyter/kernels/
, so I performed the following command:cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
Edit the new kernel.json
file in /usr/local/share/jupyter/kernels/apache_toree_scala_remote/
and add the requisite Spark options to the __TOREE_SPARK_OPTS__
variable. Technically, only --master <path>
is required, but you can also add --num-executors, --executor-memory, etc to the variable as well.
Restart Jupyter.
My kernel.json file looks like this:
{
"display_name": "Toree - Scala Remote",
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
"--profile",
"{connection_file}"
],
"language": "scala",
"env": {
"PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
"SPARK_HOME": "/opt/spark",
"DEFAULT_INTERPRETER": "Scala",
"PYTHON_EXEC": "python",
"__TOREE_OPTS__": "",
"__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
}
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With