Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Toree to connect to a remote spark cluster

Is there a way to connect Apache Toree to a remote spark cluster? I see the common command is

jupyter toree install --spark_home=/usr/local/bin/apache-spark/

How can I go about using spark on a remote server without having to install locally?

like image 761
yunli.tang Avatar asked Feb 17 '17 18:02

yunli.tang


People also ask

What is Apache Toree?

Apache Toree is a kernel for the Jupyter Notebook platform providing interactive access to Apache Spark. It has been developed using the IPython messaging protocol and 0MQ, and despite the protocol's name, Apache Toree currently exposes the Spark programming model in Scala, Python and R languages.

What is spark kernel?

spark-kernel (homepage) The Spark Kernel enables remote applications to dynamically interact with Apache Spark. It serves as a remote Spark Shell that uses the IPython message protocol to provide a common entrypoint for applications (including IPython itself).


1 Answers

There is indeed a way of getting Toree to connect to a remote Spark cluster.

The easiest way I've discovered is to clone the existing Toree Scala/Python kernel, and create a new Toree Scala/Python Remote kernel. That way you can have the choice of running locally or remotely.

Steps:

  1. Make a copy of the existing kernel. On my particular Toree install, the path to the Kernels was located at: /usr/local/share/jupyter/kernels/, so I performed the following command:
    cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/

  2. Edit the new kernel.json file in /usr/local/share/jupyter/kernels/apache_toree_scala_remote/ and add the requisite Spark options to the __TOREE_SPARK_OPTS__ variable. Technically, only --master <path> is required, but you can also add --num-executors, --executor-memory, etc to the variable as well.

  3. Restart Jupyter.

My kernel.json file looks like this:

{
  "display_name": "Toree - Scala Remote",
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
    "--profile",
    "{connection_file}"
  ],
  "language": "scala",
  "env": {
    "PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
    "SPARK_HOME": "/opt/spark",
    "DEFAULT_INTERPRETER": "Scala",
    "PYTHON_EXEC": "python",
    "__TOREE_OPTS__": "",
    "__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
  }
}
like image 78
JamCon Avatar answered Oct 21 '22 11:10

JamCon