Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark 2.0 - java.io.IOException: Cannot run program "jupyter": error=2, No such file or directory

Tags:

I am using jupyter notebooks to try spark.

Once in my notetbook I try a Kmean:

from pyspark.ml.clustering import KMeans
from sklearn               import datasets
import pandas as pd

spark = SparkSession\
        .builder\
        .appName("PythonKMeansExample")\
        .getOrCreate()

iris       = datasets.load_iris()
pd_df      = pd.DataFrame(iris['data'])
spark_df   = spark.createDataFrame(pd_df, ["features"])
estimator  = KMeans(k=3, seed=1)

Everything goes fine, then I fit the model :

estimator.fit(spark_df)

And I got an error :

16/08/16 22:39:58 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 24)
java.io.IOException: Cannot run program "jupyter": error=2, No such file or directory

Caused by: java.io.IOException: error=2, No such file or directory

Where is spark looking for Jupyter ? Why can't it find it if I can use jupyter notebook ? What to do ?..

like image 200
Romain Jouin Avatar asked Aug 16 '16 20:08

Romain Jouin


1 Answers

as code says in https://github.com/apache/spark/blob/master/python/pyspark/context.py#L180

self.pythonExec = os.environ.get("PYSPARK_PYTHON", 'python')

so I think this error is caused by env variable PYSPARK_PYTHON, it indicates that python location of each spark node, when pyspark started, PYSPARK_PYTHON which is from sys env will be injected to all sparknodes, so that

  1. it can be solved by

    export PYSPARK_PYTHON=/usr/bin/python
    

    which are the same version on diff nodes. and then start:

    pyspark
    
  2. if there is diff versions of python among local and diff nodes of cluster, another version conflicts error will occur.

  3. the version of the interactive python which you work in should be the same version with other nodes in cluster.

like image 127
fandyst Avatar answered Sep 23 '22 16:09

fandyst