Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I install Python libraries automatically on Dataproc cluster startup?

How can I automatically install Python libraries on my Dataproc cluster when the cluster starts? This would save me the trouble of manually logging into the master and/or worker nodes to manually install the libraries I need.

It would be great to also know if this automated installation could install things only on the master and not the workers.

like image 211
James Avatar asked Sep 23 '15 17:09

James


1 Answers

Initialization actions are the best way to do this. Initialization actions are shell scripts which are run when the cluster is created. This will let you customize the cluster, such as installing Python libraries. These scripts must be stored in Google Cloud Storage and can be used when creating clusters via the Google Cloud SDK or the Google Developers Console.

Here is a sample initialization action to install the Python pandas on cluster creation only on the master node.

#!/bin/sh
ROLE=$(/usr/share/google/get_metadata_value attributes/role)
if [[ "${ROLE}" == 'Master' ]]; then 
  apt-get install python-pandas -y
fi

As you can see from this script, it is possible to discern the role of a node with /usr/share/google/get_metadata_value attributes/role and then perform action specifically on the master (or worker) node.

You can see the Google Cloud Dataproc Documentation for more details

like image 138
James Avatar answered Oct 02 '22 01:10

James