Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCP Dataproc custom image Python environment

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env variable to force pyspark to use python3. But when I submit a job on a cluster created (with single node flag for simplicity) with this image, the job can't find the packages installed. If I log on the cluster machine and run the pyspark command, starts the Anaconda PySpark, but if I log on with root user and run pyspark I have the pyspark with python 3.5.3. This is a very strange. What I don't understand is which user is used to create the image? Why I have a different environment for my user and root user? I expect that the image is provisioned with root user, so I expect that all my packages installed could be found from root user. Thanks in advance

like image 411
Claudio Avatar asked Jul 12 '19 13:07

Claudio


People also ask

How do I set environment variables in Dataproc?

Client mode (default) In client mode, driver env variables need to be set in spark-env.sh when creating the cluster. You can use --properties spark-env:[NAME]=[VALUE] as described above. Executor env variables can be set when submitting the job, for example: gcloud dataproc jobs submit spark \ --properties spark.

What languages or tools they can use when developing for cloud Dataproc?

Users can develop Dataproc jobs in languages that are popular within the Spark and Hadoop ecosystem, such as Java, Scala, Python and R. Google Cloud Dataproc is fully integrated with other Google Cloud Platform services.


1 Answers

Updated answer (Q2 2021)

The customize_conda.sh script is the recommended way of customizing Conda env for custom images.

If you need more than the script does, you can read the code and create your own script, but usually you want to use the absolute path e.g., /opt/conda/anaconda/bin/conda, /opt/conda/anaconda/bin/pip, /opt/conda/miniconda3/bin/conda, /opt/conda/miniconda3/bin/pip to install/uninstall packages for the Anaconda/Miniconda env.

Original answer (outdated)

I'd recommend you first read Configure the cluster's Python environment which gives an overview of Dataproc's Python environment on different image versions, as well as instructions on how to install packages and select Python for PySpark jobs.

In your case, 1.4 already comes with miniconda3. Init actions and jobs are executed as root. /etc/profile.d/effective-python.sh is executed to initialize the Python environment when creating the cluster. But due to the order of custom image script (first) and (then) optional component activation order, miniconda3 was not yet initialized at custom image build time, so your script actually customizes the OS's system Python, then during cluster creation time, miniconda3 initializes Python which overrides the OS's system Python.

I found a solution that, in your custom image script, add this code at the beginning, it will put you in the same Python environment as that of your jobs:

# This is /usr/bin/python
which python 

# Activate miniconda3 optional component.
cat >>/etc/google-dataproc/dataproc.properties <<EOF
dataproc.components.activate=miniconda3
EOF
bash /usr/local/share/google/dataproc/bdutil/components/activate/miniconda3.sh
source /etc/profile.d/effective-python.sh

# Now this is /opt/conda/default/bin/python
which python 

then you could install packages, e.g.:

conda install <package> -y
like image 144
Dagang Avatar answered Sep 29 '22 04:09

Dagang