Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Triggering Databricks job from Airflow without starting new cluster

I am using airflow to trigger jobs on databricks. I have many DAGs running databricks jobs and I whish to have to use only one cluster instead of many, since to my understanding this will reduce the costs these task will generate.

Using DatabricksSubmitRunOperatorthere are two ways to run a job on databricks. Either using a running cluster calling it by id

'existing_cluster_id' : '1234-567890-word123',

or starting a new cluster

'new_cluster': {
    'spark_version': '2.1.0-db3-scala2.11',
    'num_workers': 2
  },

Now I would like to try to avoid to start a new cluster for each task, however the cluster shuts down during downtime hence it will not be available trough it's id anymore and I will get an error, so the only option in my view is a new cluster.

1) Is there a way to have a cluster being callable by id even when it is down?

2) Do people simply keep the clusters alive?

3) Or am I completely wrong and starting clusters for each task won't generate more costs?

4) Is there something I missed completely?

like image 639
Yannick Widmer Avatar asked Feb 06 '19 20:02

Yannick Widmer


2 Answers

Updates based on @YannickSSE's comment response
I don't use databricks; Can you start a new cluster by the same id as the cluster you may or may not expect is running and have it be a no-op in the case that it is running? Maybe not, or you probably wouldn't be asking this. Response: no when starting a new cluster you cannot give an id.

Could you write a python or bash operator which tests for the existence of the cluster? (Response: This would be a test job submission… not the best approach.) If it finds it and succeeds the downstream task would trigger your job with the existing cluster id, but if it doesn't another downstream task could use the trigger_rule all_failed to do the same task but with a new cluster. Then both those task DatabricksSubmitRunOperators could have one downstream task with the trigger_rule one_success. (Response: Or use a branching operator to determine the operator executed.)

It might not be ideal because I imagine then that your cluster id is changing from time to time causing you to have to keep up. … Is the cluster part of the databricks hook's connection for that operator, and something that can be updated? Maybe you want to specify it in the tasks that need it as {{ var.value.<identifying>_cluster_id }} and keep it updated as an airflow variable. (Response: the cluster id is not in the hook, so the variable or DAG file would have to be updated whenever it changes.)

like image 50
dlamblin Avatar answered Sep 29 '22 21:09

dlamblin


It seems Databricks has added an option recently to reuse a job cluster within a job, sharing it between tasks.

https://databricks.com/blog/2022/02/04/saving-time-and-costs-with-cluster-reuse-in-databricks-jobs.html

Until now, each task had its own cluster to accommodate for the different types of workloads. While this flexibility allows for fine-grained configuration, it can also introduce a time and cost overhead for cluster startup or underutilization during parallel tasks.

In order to maintain this flexibility, but further improve utilization, we are excited to announce cluster reuse. By sharing job clusters over multiple tasks customers can reduce the time a job takes, reduce costs by eliminating overhead and increase cluster utilization with parallel tasks.

This seems to be available in the new API as well. https://docs.databricks.com/dev-tools/api/latest/jobs.html#operation/JobsCreate

job_clusters Array of objects (JobCluster) <= 100 items

A list of job cluster specifications that can be shared and reused by tasks of this job. Libraries cannot be declared in a shared job cluster. You must declare dependent libraries in task settings.

In order to fit your use case you could start a new cluster with your job, share it between your tasks, and it will automatically shut down at the end.

I still don't fully understand how we might keep a job cluster hot all the time if we want to have jobs start with no latency. I also don't think it's possible to share these clusters between jobs.

For now this information should provide a decent lead.

like image 34
WarSame Avatar answered Sep 29 '22 19:09

WarSame