How to run custom Python script on Jupyter Notebook launch (to boot Spark)?

Tags:

I found several tutorials on how to configure IPython Notebook to load Spark Context variable sc using PySpark (like this one: http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/).

The problem is, since now we are using Jupyter Notebook instead of IPython Notebook, we can't create a setup script to load the Spark Context variable like we did with IPython (which should be located in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py).

The question is: What are the configurations on Jupyter Notebook that will execute the script 00-pyspark-setup.py on startup?

212

asked Apr 28 '16 14:04

htaidirt

1 Answers

EDIT

The original answer should still work, but it is unwieldy and we use the following method nowadays that uses PySpark built in variables:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"

Then just run PySpark directly like you would normally, but with the above variables set it starts jupyter notebook rather than a shell:

cd path/to/spark
bin/pyspark --master local[*]  # Change to use standalone/mesos/yarn master and add any spark config

If you start a new notebook you will find Spark set up for you. You can add other options to Juopyter if you want to match your environment, like:

export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip='*' --no-browser"

ORIGINAL ANSWER

You can still set up things with the same initial steps, ie create a profile using ipython profile create pyspark and place the startup script in $(ipython profile locate pyspark)/startup/.

Next, to make it available in Jupyter notebooks you have to specify a kernel that uses that profile, by creating a file $(ipython locate)/kernels/pyspark/kernel.json. This is what mine looks like:

{
  "display_name": "PySpark",
  "language": "python",
  "argv": [
    "python",
    "-m", "ipykernel",
    "--profile=pyspark",
    "-f", "{connection_file}"
  ],
  "env": {
    "PYSPARK_SUBMIT_ARGS": " --master spark://localhost:7077 --conf spark.driver.memory=20000m  --conf spark.executor.memory=20000m"
  }
}

The important bit is in the argv section. The information in the env section is picked up by the startup script I use:

import os
import sys

spark_home = '/opt/spark/'
os.environ["SPARK_HOME"] = spark_home
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

As you can see it is quite similar to the one you linked, just plus the arguments that are defined in the kernel, and addition of pyspark-shell argument which is needed in latest version of PySpark.

Whit this, you can run jupyter notebook, open the main page in a browser and you can now create notebooks using this new kernel:

create new pyspark notebook

answered Nov 15 '22 03:11

sgvd

Related questions
                            
                                Pandas join without replacement
                            
                                Layer names for pretrained inception v3 model (tensorflow) [duplicate]
                            
                                Why Django's related_model property is returning string instead of Model instance?
                            
                                SQLAlchemy automap data and override some columns
                            
                                What is a very *simple* way to structure a python project?
                            
                                Cleanest way to set xtickslabel in specific position
                            
                                Supporting python 2 and 3: str, bytes or alternative
                            
                                Supporting multiple Python module versions (with the same version of Python)
                            
                                mpi4py scatter and gather with large numpy arrays
                            
                                Port forwarding in python to allow socket connections
                            
                                How to get installation directory using setuptools and pkg_ressources
                            
                                How to do GridSearchCV with OneVsRestClassifier?
                            
                                Multi-processing in Python: Numpy + Vector Summation -> Huge Slowdown
                            
                                Functional solution for short path algorithm in Python
                            
                                How to get a Python long double literal?
                            
                                Detecting the upper side of a dice
                            
                                Modifying yield from's return value
                            
                                simply trying to use the form I've used in the one template to another
                            
                                using BufferedWriter in flask whooshalchemy
                            
                                Why is django_migrations table in all databases

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to run custom Python script on Jupyter Notebook launch (to boot Spark)?

Tags:

python

ipython

jupyter-notebook

apache-spark

htaidirt

People also ask

1 Answers

sgvd

Recent Activity

Donate For Us