Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to install python packages in a Google Dataproc cluster

Is it possible to install python packages in a Google Dataproc cluster after the cluster is created and running?

I tried to use "pip install xxxxxxx" in the master command line but it does not seem to work.

Google's Dataproc documentation does not mention this situation.

like image 245
Pablo Brenner Avatar asked May 10 '18 19:05

Pablo Brenner


Video Answer


1 Answers

This is generally not possible after cluster is created. I recommend using an initialization action to do this.

As you've noticed, pip is also not available by default. So you'll want to run easy_install pip followed by pip install command.

Finally, if your intention is to use this cluster in any automation, and/or you want hermeticness, I recommend creating a wheel that you store in GCS and download in init action. You'd then install your wheel. Wheels have added benefit of being faster than installing many packages from pip directly.

2019 Update

See this tutorial on how to configure Python environment on Dataproc: https://cloud.google.com/dataproc/docs/tutorials/python-configuration

like image 116
tix Avatar answered Sep 28 '22 01:09

tix