I found several tutorials on how to configure IPython Notebook to load Spark Context variable sc
using PySpark (like this one: http://ramhiser.com/2015/02/01/configuring-ipython-notebook-support-for-pyspark/).
The problem is, since now we are using Jupyter Notebook instead of IPython Notebook, we can't create a setup script to load the Spark Context variable like we did with IPython (which should be located in ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py
).
The question is: What are the configurations on Jupyter Notebook that will execute the script 00-pyspark-setup.py
on startup?
To run a piece of code, click on the cell to select it, then press SHIFT+ENTER or press the play button in the toolbar above. Additionally, the Cell dropdown menu has several options to run cells, including running one cell at a time or to run all cells at once.
Spark environment provides a command to execute the application file, be it in Scala or Java(need a Jar format), Python and R programming file. The command is, $ spark-submit --master <url> <SCRIPTNAME>. py .
Run PySpark Application from spark-submit py, . egg, . zip file to spark submit command using --py-files option for any dependencies.
EDIT
The original answer should still work, but it is unwieldy and we use the following method nowadays that uses PySpark built in variables:
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
Then just run PySpark directly like you would normally, but with the above variables set it starts jupyter notebook rather than a shell:
cd path/to/spark
bin/pyspark --master local[*] # Change to use standalone/mesos/yarn master and add any spark config
If you start a new notebook you will find Spark set up for you. You can add other options to Juopyter if you want to match your environment, like:
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --ip='*' --no-browser"
ORIGINAL ANSWER
You can still set up things with the same initial steps, ie create a profile using ipython profile create pyspark
and place the startup script in $(ipython profile locate pyspark)/startup/
.
Next, to make it available in Jupyter notebooks you have to specify a kernel that uses that profile, by creating a file $(ipython locate)/kernels/pyspark/kernel.json
. This is what mine looks like:
{
"display_name": "PySpark",
"language": "python",
"argv": [
"python",
"-m", "ipykernel",
"--profile=pyspark",
"-f", "{connection_file}"
],
"env": {
"PYSPARK_SUBMIT_ARGS": " --master spark://localhost:7077 --conf spark.driver.memory=20000m --conf spark.executor.memory=20000m"
}
}
The important bit is in the argv
section. The information in the env
section is picked up by the startup script I use:
import os
import sys
spark_home = '/opt/spark/'
os.environ["SPARK_HOME"] = spark_home
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.9-src.zip'))
pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
pyspark_submit_args += " pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args
filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))
As you can see it is quite similar to the one you linked, just plus the arguments that are defined in the kernel, and addition of pyspark-shell
argument which is needed in latest version of PySpark.
Whit this, you can run jupyter notebook
, open the main page in a browser and you can now create notebooks using this new kernel:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With