Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Jupyter Notebook only runs locally on Spark

I'm trying to use jupyter-notebook (v4.2.2) remotely on a spark cluster (v2.0), but when I run the following command it does not run on spark but only runs locally:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --master spark://**spark_master_hostname**:7077

When I run pyspark alone with the same --master argument, the process shows up in "Running Applications" for the spark cluster just fine.

pyspark --master spark://**spark_master_hostname**:7077

It's almost as if pyspark is not being run in the former. Is there something wrong with the first command preventing jupyter from running on the spark cluster or a better way of running notebooks on a spark cluster?

like image 547
user6837711 Avatar asked Sep 16 '16 03:09

user6837711


People also ask

Are Jupyter notebooks run locally?

(If you don't understand this yet, don't worry — the important point is just that although Jupyter Notebooks opens in your browser, it's being hosted and run on your local machine.

Can Jupyter Notebook run spark?

JupyterLab is the next-gen notebook interface that further enhances the functionality of Jupyter to create a more flexible tool that can be used to support any workflow from data science to machine learning. Jupyter also supports Big data tools such as Apache Spark for data analytics needs.

Does Jupyter Notebook use local resources?

When you connect to a local Jupyter server, you allow the Colab frontend to execute code in the notebook using local resources, accessing the local file system. Before attempting to connect to a local runtime, make sure you trust the author of the notebook and ensure you understand the code that is being executed.


1 Answers

It looks that you want to load IPython shell, not IPython notebook and use PySpark through command line?

IMO Jupiter UI is more convenient way to work with notebooks.

You can run jupyter server:

jupyter notebook

then (using jupyter UI) start new Python2 kernel. In opened notebook create SparkContext with configuration pointing to your spark cluster:

from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster('spark://**spark_master_hostname**:7077')
conf.setAppName('some-app-name')
sc = SparkContext(conf=conf)

Now you have pyspark application started on spark cluster and you can interact with it via created SparkContext. i.e.:

def mod(x):
    import numpy as np
    return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print rdd

The code above will be computed remotely.

like image 121
Artur I Avatar answered Oct 10 '22 01:10

Artur I