Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a way to reuse a single running databricks cluster in multiple mapping data flows

Is there a way to reuse a databricks cluster that is started by a web activity before we run the mapping data flows and use the same running cluster in all of the data flows instead of letting all the data flow instances spin up their own clusters which takes around 6 minutes for setting up each cluster?


1 Answers

Yes. Set the TTL in the Azure Integration Runtime under "Data Flow Properties" to an amount of time that there is a gap in between data flow job executions. This way, we can set-up a VM pool for you and reuse those resource to minimize the cluster start-up time: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380.

To start the cluster, don't use a web activity. Use a "dummy" data flow as I demonstrate here: https://youtu.be/FFCbU4ujCiY?t=533.

In ADF, you cannot access the underlying compute engines (Databricks in this case), so you have to kick-off a dummy data flow to warm it up.

That cluster start-up will take 5-6 mins. But now, if you use that same Azure IR in your subsequent activities, as long as they are scheduled to execute within that TTL window, ADF can grab existing VM resources to spin-up the Spark clusters and marshall your data flow definition to the Spark job execution.

End-to-end that process should now take just 2 mins.

like image 125
Mark Kromer MSFT Avatar answered Sep 05 '25 15:09

Mark Kromer MSFT



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!