I'm trying to use jupyter-notebook (v4.2.2)
remotely on a spark cluster (v2.0)
, but when I run the following command it does not run on spark but only runs locally:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --master spark://**spark_master_hostname**:7077
When I run pyspark
alone with the same --master argument
, the process shows up in "Running Applications"
for the spark cluster just fine.
pyspark --master spark://**spark_master_hostname**:7077
It's almost as if pyspark is not being run in the former. Is there something wrong with the first command preventing jupyter from running on the spark cluster or a better way of running notebooks on a spark cluster?
(If you don't understand this yet, don't worry — the important point is just that although Jupyter Notebooks opens in your browser, it's being hosted and run on your local machine.
JupyterLab is the next-gen notebook interface that further enhances the functionality of Jupyter to create a more flexible tool that can be used to support any workflow from data science to machine learning. Jupyter also supports Big data tools such as Apache Spark for data analytics needs.
When you connect to a local Jupyter server, you allow the Colab frontend to execute code in the notebook using local resources, accessing the local file system. Before attempting to connect to a local runtime, make sure you trust the author of the notebook and ensure you understand the code that is being executed.
It looks that you want to load IPython shell, not IPython notebook and use PySpark through command line?
IMO Jupiter UI is more convenient way to work with notebooks.
You can run jupyter server:
jupyter notebook
then (using jupyter UI) start new Python2 kernel. In opened notebook create SparkContext with configuration pointing to your spark cluster:
from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster('spark://**spark_master_hostname**:7077')
conf.setAppName('some-app-name')
sc = SparkContext(conf=conf)
Now you have pyspark application started on spark cluster and you can interact with it via created SparkContext. i.e.:
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print rdd
The code above will be computed remotely.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With