Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run custom Python script on Jupyter Notebook launch (to boot Spark)?

I found several tutorials on how to configure IPython Notebook to load Spark Context variable sc using PySpark (like this one: http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/).

The problem is, since now we are using Jupyter Notebook instead of IPython Notebook, we can't create a setup script to load the Spark Context variable like we did with IPython (which should be located in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py).

The question is: What are the configurations on Jupyter Notebook that will execute the script 00-pyspark-setup.py on startup?

like image 212
htaidirt Avatar asked Apr 28 '16 14:04

htaidirt


People also ask

How do I run a Python script on a Jupyter Notebook?

To run a piece of code, click on the cell to select it, then press SHIFT+ENTER or press the play button in the toolbar above. Additionally, the Cell dropdown menu has several options to run cells, including running one cell at a time or to run all cells at once.

How do I run a Python script in spark?

Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file. The command is, $ spark-submit --master <url> <SCRIPTNAME>. py .

How do I run a .py file in spark shell?

Run PySpark Application from spark-submit py, . egg, . zip file to spark submit command using --py-files option for any dependencies.


1 Answers

EDIT

The original answer should still work, but it is unwieldy and we use the following method nowadays that uses PySpark built in variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

Then just run PySpark directly like you would normally, but with the above variables set it starts jupyter notebook rather than a shell:

cd path/to/spark
bin/pyspark --master local[*]  # Change to use standalone/mesos/yarn master and add any spark config

If you start a new notebook you will find Spark set up for you. You can add other options to Juopyter if you want to match your environment, like:

export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip='*' --no-browser"

ORIGINAL ANSWER

You can still set up things with the same initial steps, ie create a profile using ipython profile create pyspark and place the startup script in $(ipython profile locate pyspark)/startup/.

Next, to make it available in Jupyter notebooks you have to specify a kernel that uses that profile, by creating a file $(ipython locate)/kernels/pyspark/kernel.json. This is what mine looks like:

{
  "display_name": "PySpark",
  "language": "python",
  "argv": [
    "python",
    "-m", "ipykernel",
    "--profile=pyspark",
    "-f", "{connection_file}"
  ],
  "env": {
    "PYSPARK_SUBMIT_ARGS": " --master spark://localhost:7077 --conf spark.driver.memory=20000m  --conf spark.executor.memory=20000m"
  }
}

The important bit is in the argv section. The information in the env section is picked up by the startup script I use:

import os
import sys

spark_home = '/opt/spark/'
os.environ["SPARK_HOME"] = spark_home
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

As you can see it is quite similar to the one you linked, just plus the arguments that are defined in the kernel, and addition of pyspark-shell argument which is needed in latest version of PySpark.

Whit this, you can run jupyter notebook, open the main page in a browser and you can now create notebooks using this new kernel:

create new pyspark notebook

like image 73
sgvd Avatar answered Nov 15 '22 03:11

sgvd