Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there anyway I can use preemptible instance for dataflow jobs?

It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.

        options.setTempLocation("gs://temp/");
        options.setRunner(DataflowRunner.class);
        options.setTemplateLocation("gs://temp-location/");
        options.setWorkerMachineType("n1-standard-4");
        options.setMaxNumWorkers(20);
        options.setWorkerCacheMb(2000);

I'm not able to find out any pipeline options with preemptible instance setting.

like image 360
miles212 Avatar asked Feb 09 '20 15:02

miles212


People also ask

When can preemptible instances be used?

Preemptible instances behave the same as regular compute instances, but the capacity is reclaimed when it's needed elsewhere, and the instances are terminated. If your workloads are fault-tolerant and can withstand interruptions, then preemptible instances can reduce your costs.

Why would you use preemptible VMs?

Preemptible instances use excess Compute Engine capacity, so their availability varies with usage. If your apps are fault-tolerant and can withstand possible instance preemptions, then preemptible instances can reduce your Compute Engine costs significantly.

What is the maximum life of a preemptible VM?

Preemptible VMs always stop after 24 hours. Preemptible VMs are recommended only for fault-tolerant applications that can withstand VM preemption.

Is it possible to share data across pipeline instances?

Is it possible to share data across pipeline instances? There is no Dataflow-specific cross pipeline communication mechanism for sharing data or processing context between pipelines. You can use durable storage like Cloud Storage or an in-memory cache like App Engine to share data between pipeline instances.


1 Answers

Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:

  • Delayed execution: jobs are scheduled and not executed right away (you can see a new QUEUED status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
  • Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:

    You cannot set autoscalingAlgorithm=NONE

  • Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
  • Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
  • Machine types: FlexRS currently supports n1-standard-2 (default) and n1-highmem-16.
  • SDK: requires 2.12.0 or newer for Java or Python.
  • Quota: quota is reserved upfront (i.e. queued jobs also consume quota).

In order to run it, use --flexRSGoal=COST_OPTIMIZED and make sure to take into account that the rest of parameters conform to the FlexRS needs.

A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.

Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.

like image 110
Guillem Xercavins Avatar answered Sep 26 '22 08:09

Guillem Xercavins