Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pausing Dataproc cluster - Google Compute engine

is there a way of pausing a Dataproc cluster so I don't get billed when I am not actively running spark-shell or spark-submit jobs ? The cluster management instructions at this link: https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/

only show how to destroy a cluster but I have installed spark cassandra connector API for example. Is my only alternative to just creating an image that I'll need to install every time ?

like image 861
femibyte Avatar asked Jan 01 '16 17:01

femibyte


1 Answers

In general, the best thing to do is to distill out the steps you used to customize your cluster into some setup scripts, and then use Dataproc's initialization actions to easily automate doing the installation during cluster deployment.

This way, you can easily reproduce the customizations without requiring manual involvement if you ever want, for example, to do the same setup on multiple concurrent Dataproc clusters, or want to change machine types, or receive sub-minor-version bug fixes that Dataproc releases occasionally.

There's indeed no officially supported way of pausing a Dataproc cluster at the moment, in large part simply because being able to have reproducible cluster deployments along with several other considerations listed below means that 99% of the time it's better to use initialization-action customizations instead of pausing a cluster in-place. That said, there are possible short-term hacks, such as going into the Google Compute Engine page, selecting the instances that are part of the Dataproc cluster you want to pause, and clicking "stop" without deleting them.

The Compute Engine hourly charges and Dataproc's per-vCPU charges are only incurred when the underlying instance is running, so while you've "stopped" the instances manually, you won't incur Dataproc or Compute Engine's instance-hour charges despite Dataproc still listing the cluster as "RUNNING", albeit with warnings that you'll see if you go to the "VM Instances" tab of the Dataproc cluster summary page.

You should then be able to just click "start" from the Google Compute Engine page page to have the cluster running again, but it's important to consider the following caveats:

  1. The cluster may occasionally fail to start up into a healthy state again; anything using local SSDs already can't be stopped and started again cleanly, but beyond that, Hadoop daemons may have failed for whatever reason to flush something important to disk if the shutdown wasn't orderly, or even user-installed settings may have broken the startup process in unknown ways.
  2. Even when VMs are "stopped", they depend on the underlying Persistent Disks remaining, so you'll continue to incur charges for those even while "paused"; if we assume $0.04 per GB-month, and a default 500GB disk per Dataproc node, that comes out to continuing to pay ~$0.028/hour per instance; generally your data will be more accessible and also cheaper to just put in Google Cloud Storage for long term storage rather than trying to keep it long-term on the Dataproc cluster's HDFS.
  3. If you come to depend on a manual cluster setup too much, then it'll become much more difficult to re-do if you need to size up your cluster, or change machine types, or change zones, etc. In contrast, with Dataproc's initialization actions, you can use Dataproc's cluster scaling feature to resize your cluster and automatically run the initialization actions for new workers created.

Update

Dataproc recently launched the ability to stop and start clusters: https://cloud.google.com/dataproc/docs/guides/dataproc-start-stop

like image 187
Dennis Huo Avatar answered Sep 21 '22 07:09

Dennis Huo