I create a dataproc cluster using the following command
gcloud dataproc clusters create datascience \
--initialization-actions \
gs://dataproc-initialization-actions/jupyter/jupyter.sh \
However when I submit my PySpark Job I got the following error
Exception: Python in worker has different version 3.4 than that in driver 3.7, PySpark cannot run with different minor versions.Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.
Any Thoughts?
PySpark is the Python API for Apache Spark, an open source, distributed computing framework and set of libraries for real-time, large-scale data processing. If you're already familiar with Python and libraries such as Pandas, then PySpark is a good language to learn to create more scalable analyses and pipelines.
This is due to a difference in the python versions between the master and the worker. By default, the jupyter image
installs the latest version of miniconda, which uses the python3.7. However, the worker is still using the default python3.6.
Solution: - specify the miniconda version when creating the master node i.e to install python3.6 in the master node
gcloud dataproc clusters create example-cluster --metadata=MINICONDA_VERSION=4.3.30
Note:
UPDATE THE SPARK ENVIRONMENT TO USE PYTHON 3.7:
Open a new terminal and type the following command: export PYSPARK_PYTHON=python3.7
This will ensure that the worker nodes use Python 3.7 (same as the Driver) and not the default Python 3.4
DEPENDING ON VERSIONS OF PYTHON YOU HAVE, YOU MAY HAVE TO DO SOME INSTALL/UPDATE ANACONDA:
(To install see: https://www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-quickstart)
Make sure you have anaconda 4.1.0 or higher. Open a new terminal and check your conda version by typing into a new terminal:
conda --version
checking conda version
if you are below anaconda 4.1.0, type conda update conda
conda list
Checking if we have nb_conda_kernels
nb_conda_kernels
typeconda install nb_conda_kernels
Installing nb_conda_kernels
conda create -n py36 python=3.6 ipykernel
py35 is the name of the environment. You could literally name it anything you want.
Alternatively, If you are using Python 3 and want a separate Python 2 environment, you could type the following.
conda create -n py27 python=2.7 ipykernel
py27 is the name of the environment. It uses python 2.7.
pyspark
. You should see the new environments appearing.We fixed it now -- thanks for the intermediate workaround @brotich. Check out the discussion in #300.
PR #306 keeps python at the same version as was already installed (3.6), and installs packages on all nodes to ensure that the master and worker python environments stay identical.
As a side effect, you can choose your python version by passing an argument to the conda init action to change the python version. E.g. --metadata 'CONDA_PACKAGES="python==3.5"'
.
PR #311 pins miniconda to a particular version (currently 4.5.4), so we avoid issues like this again. You can use --metadata 'MINICONDA_VERSION=latest'
to use the old behavior of always downloading the latest miniconda.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With