How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need.
It would be great to also know if this automated installation could install things only on the master and not the workers.
Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This will let you customize the cluster, such as installing Python libraries. These scripts must be stored in Google Cloud Storage and can be used when creating clusters via the Google Cloud SDK or the Google Developers Console.
Here is a sample initialization action to install the Python pandas on cluster creation only on the master node.
#!/bin/sh
ROLE=$(/usr/share/google/get_metadata_value attributes/role)
if [[ "${ROLE}" == 'Master' ]]; then
apt-get install python-pandas -y
fi
As you can see from this script, it is possible to discern the role of a node with /usr/share/google/get_metadata_value attributes/role
and then perform action specifically on the master (or worker) node.
You can see the Google Cloud Dataproc Documentation for more details
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With