GCP Dataproc custom image Python environment

Tags:

I have an issue when I create a DataProc custom image and Pyspark. My custom image is based on DataProc 1.4.1-debian9 and with my initialisation script I install python3 and some packages from a requirements.txt file, then set the python3 env variable to force pyspark to use python3. But when I submit a job on a cluster created (with single node flag for simplicity) with this image, the job can't find the packages installed. If I log on the cluster machine and run the pyspark command, starts the Anaconda PySpark, but if I log on with root user and run pyspark I have the pyspark with python 3.5.3. This is a very strange. What I don't understand is which user is used to create the image? Why I have a different environment for my user and root user? I expect that the image is provisioned with root user, so I expect that all my packages installed could be found from root user. Thanks in advance

411

asked Jul 12 '19 13:07

Claudio

1 Answers

Updated answer (Q2 2021)

The customize_conda.sh script is the recommended way of customizing Conda env for custom images.

If you need more than the script does, you can read the code and create your own script, but usually you want to use the absolute path e.g., /opt/conda/anaconda/bin/conda, /opt/conda/anaconda/bin/pip, /opt/conda/miniconda3/bin/conda, /opt/conda/miniconda3/bin/pip to install/uninstall packages for the Anaconda/Miniconda env.

Original answer (outdated)

I'd recommend you first read Configure the cluster's Python environment which gives an overview of Dataproc's Python environment on different image versions, as well as instructions on how to install packages and select Python for PySpark jobs.

In your case, 1.4 already comes with miniconda3. Init actions and jobs are executed as root. /etc/profile.d/effective-python.sh is executed to initialize the Python environment when creating the cluster. But due to the order of custom image script (first) and (then) optional component activation order, miniconda3 was not yet initialized at custom image build time, so your script actually customizes the OS's system Python, then during cluster creation time, miniconda3 initializes Python which overrides the OS's system Python.

I found a solution that, in your custom image script, add this code at the beginning, it will put you in the same Python environment as that of your jobs:

# This is /usr/bin/python
which python 

# Activate miniconda3 optional component.
cat >>/etc/google-dataproc/dataproc.properties <<EOF
dataproc.components.activate=miniconda3
EOF
bash /usr/local/share/google/dataproc/bdutil/components/activate/miniconda3.sh
source /etc/profile.d/effective-python.sh

# Now this is /opt/conda/default/bin/python
which python

then you could install packages, e.g.:

conda install <package> -y

144

answered Sep 29 '22 04:09

Dagang

Related questions
                            
                                How to multi-thread with "for" loop?
                            
                                pytest fixtures and threads synchronizations
                            
                                Save a model for TensorFlow Serving with api endpoint mapped to certain method using SignatureDefs?
                            
                                ValueError: When changing to a larger dtype, its size must be a divisor of the total size in bytes of the last axis of the array
                            
                                Why do I get subprocess resource warnings despite the process being dead?
                            
                                Unable to install Airflow even after setting SLUGIFY_USES_TEXT_UNIDECODE and AIRFLOW_GPL_UNIDECODE
                            
                                How to find the index of a value by row in a dataframe in python and extract the value of the following column
                            
                                Search in Rotated Sorted Array in O(log n) time
                            
                                Webapp2 Python set_cookie does not support samesite cookie?
                            
                                How do I find the index at which a given value will be reached/cross by another series?
                            
                                Heroku Deployment Error: No matching distribution found for en-core-web-sm
                            
                                Cosine similarity between 0 and 1
                            
                                Speech Recognition UnknownValueError
                            
                                Pytest - Calling a fixture from another fixture
                            
                                Batch-Matrix multiplication in Pytorch - Confused with the handling of the output's dimension
                            
                                Detecting corrupt images in Tensorflow
                            
                                Show top level dependencies for a conda managed environment
                            
                                How to use Spark Streaming to read a stream and find the IP over a time Window?
                            
                                Is there a restriction on catplot with subplot?
                            
                                Python - Run Job every first Monday of month

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GCP Dataproc custom image Python environment

Tags:

python

google-cloud-platform

pyspark

google-cloud-dataproc