It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.
options.setTempLocation("gs://temp/");
options.setRunner(DataflowRunner.class);
options.setTemplateLocation("gs://temp-location/");
options.setWorkerMachineType("n1-standard-4");
options.setMaxNumWorkers(20);
options.setWorkerCacheMb(2000);
I'm not able to find out any pipeline options with preemptible instance setting.
Preemptible instances behave the same as regular compute instances, but the capacity is reclaimed when it's needed elsewhere, and the instances are terminated. If your workloads are fault-tolerant and can withstand interruptions, then preemptible instances can reduce your costs.
Preemptible instances use excess Compute Engine capacity, so their availability varies with usage. If your apps are fault-tolerant and can withstand possible instance preemptions, then preemptible instances can reduce your Compute Engine costs significantly.
Preemptible VMs always stop after 24 hours. Preemptible VMs are recommended only for fault-tolerant applications that can withstand VM preemption.
Is it possible to share data across pipeline instances? There is no Dataflow-specific cross pipeline communication mechanism for sharing data or processing context between pipelines. You can use durable storage like Cloud Storage or an in-memory cache like App Engine to share data between pipeline instances.
Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
QUEUED
status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.You cannot set autoscalingAlgorithm=NONE
n1-standard-2
(default) and n1-highmem-16
.In order to run it, use --flexRSGoal=COST_OPTIMIZED
and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With