I am using jupyter notebooks to try spark.
Once in my notetbook I try a Kmean:
from pyspark.ml.clustering import KMeans
from sklearn import datasets
import pandas as pd
spark = SparkSession\
.builder\
.appName("PythonKMeansExample")\
.getOrCreate()
iris = datasets.load_iris()
pd_df = pd.DataFrame(iris['data'])
spark_df = spark.createDataFrame(pd_df, ["features"])
estimator = KMeans(k=3, seed=1)
Everything goes fine, then I fit the model :
estimator.fit(spark_df)
And I got an error :
16/08/16 22:39:58 ERROR Executor: Exception in task 0.2 in stage 0.0 (TID 24)
java.io.IOException: Cannot run program "jupyter": error=2, No such file or directory
Caused by: java.io.IOException: error=2, No such file or directory
Where is spark looking for Jupyter ? Why can't it find it if I can use jupyter notebook ? What to do ?..
as code says in https://github.com/apache/spark/blob/master/python/pyspark/context.py#L180
self.pythonExec = os.environ.get("PYSPARK_PYTHON", 'python')
so I think this error is caused by env variable PYSPARK_PYTHON
, it indicates that python location of each spark node, when pyspark started, PYSPARK_PYTHON
which is from sys env will be injected to all sparknodes, so that
it can be solved by
export PYSPARK_PYTHON=/usr/bin/python
which are the same version on diff nodes. and then start:
pyspark
if there is diff versions of python among local and diff nodes of cluster, another version conflicts error will occur.
the version of the interactive python which you work in should be the same version with other nodes in cluster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With