Jupyter Notebook only runs locally on Spark

Tags:

apache-spark

I'm trying to use jupyter-notebook (v4.2.2) remotely on a spark cluster (v2.0), but when I run the following command it does not run on spark but only runs locally:

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --master spark://**spark_master_hostname**:7077

When I run pyspark alone with the same --master argument, the process shows up in "Running Applications" for the spark cluster just fine.

pyspark --master spark://**spark_master_hostname**:7077

It's almost as if pyspark is not being run in the former. Is there something wrong with the first command preventing jupyter from running on the spark cluster or a better way of running notebooks on a spark cluster?

547

asked Sep 16 '16 03:09

user6837711

1 Answers

It looks that you want to load IPython shell, not IPython notebook and use PySpark through command line?

IMO Jupiter UI is more convenient way to work with notebooks.

You can run jupyter server:

jupyter notebook

then (using jupyter UI) start new Python2 kernel. In opened notebook create SparkContext with configuration pointing to your spark cluster:

from pyspark import SparkContext, SparkConf
conf = SparkConf()
conf.setMaster('spark://**spark_master_hostname**:7077')
conf.setAppName('some-app-name')
sc = SparkContext(conf=conf)

Now you have pyspark application started on spark cluster and you can interact with it via created SparkContext. i.e.:

def mod(x):
    import numpy as np
    return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print rdd

The code above will be computed remotely.

121

answered Oct 10 '22 01:10

Artur I

Related questions
                            
                                Spark SQL exception handling
                            
                                Spark driver pod getting killed with 'OOMKilled' status
                            
                                Is Tachyon by default implemented by the RDD's in Apache Spark?
                            
                                Spark DataFrame: operate on groups
                            
                                pyspark : how to check if a file exists in hdfs
                            
                                Scope of 'spark.driver.maxResultSize'
                            
                                Making spark use /etc/hosts file for binding in YARN cluster mode
                            
                                Spark serialization error mystery
                            
                                Spark: More Efficient Aggregation to join strings from different rows
                            
                                Spark SQL performance: version 1.6 vs version 1.5
                            
                                What's the limit to spark streaming in terms of data amount?
                            
                                Jupyter & PySpark: How to run multiple notebooks
                            
                                how to read and write to the same file in spark using parquet?
                            
                                Writing From Spark to DynamoDB
                            
                                Is there a Spark SQL jdbc driver?
                            
                                Why is it possible to have "serialized results of n tasks (XXXX MB)" be greater than `spark.driver.memory` in pyspark?
                            
                                Spark - No FileSystem for scheme: https, cannot load files from Amazon S3

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With