Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

I have installed pyspark recently. It was installed correctly. When I am using following simple program in python, I am getting an error.

>>from pyspark import SparkContext >>sc = SparkContext() >>data = range(1,1000) >>rdd = sc.parallelize(data) >>rdd.collect() 

while running the last line I am getting error whose key line seems to be

[Stage 0:>                                                          (0 + 0) / 4]18/01/15 14:36:32 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1) org.apache.spark.api.python.PythonException: Traceback (most recent call last):   File "/usr/local/lib/python3.5/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 123, in main     ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 2.7 than that in driver 3.5, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. 

I have the following variables in .bashrc

export SPARK_HOME=/opt/spark export PYTHONPATH=$SPARK_HOME/python3 

I am using Python 3.

like image 627
Akash Kumar Avatar asked Jan 15 '18 09:01

Akash Kumar


People also ask

How do I set environment variables in Pyspark?

Before starting PySpark, you need to set the following environments to set the Spark path and the Py4j path. Or, to set the above environments globally, put them in the . bashrc file. Then run the following command for the environments to work.

How do I set environment variables in spark?

Spark Configuration Spark properties control most application parameters and can be set by using a SparkConf object, or through Java system properties. Environment variables can be used to set per-machine settings, such as the IP address, through the conf/spark-env.sh script on each node.

What is the Pythonpath environment variable?

PYTHONPATH is an environment variable which you can set to add additional directories where python will look for modules and packages. For most installations, you should not set these variables since they are not needed for Python to run. Python knows where to find its standard library.

How do I change my Python version to worker?

In case you only want to change the python version for current task, you can use following pyspark start command: PYSPARK_DRIVER_PYTHON=/home/user1/anaconda2/bin/python PYSPARK_PYTHON=/usr/local/anaconda2/bin/python pyspark --master ..


2 Answers

By the way, if you use PyCharm, you could add PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to run/debug configurations per image below enter image description here

like image 186
buxizhizhoum Avatar answered Sep 23 '22 10:09

buxizhizhoum


You should set the following environment variables in $SPARK_HOME/conf/spark-env.sh:

export PYSPARK_PYTHON=/usr/bin/python export PYSPARK_DRIVER_PYTHON=/usr/bin/python 

If spark-env.sh doesn't exist, you can rename spark-env.sh.template

like image 39
Alex Avatar answered Sep 22 '22 10:09

Alex