Is there a way to reuse a databricks cluster that is started by a web activity before we run the mapping data flows and use the same running cluster in all of the data flows instead of letting all the data flow instances spin up their own clusters which takes around 6 minutes for setting up each cluster?
Yes. Set the TTL in the Azure Integration Runtime under "Data Flow Properties" to an amount of time that there is a gap in between data flow job executions. This way, we can set-up a VM pool for you and reuse those resource to minimize the cluster start-up time: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380.
To start the cluster, don't use a web activity. Use a "dummy" data flow as I demonstrate here: https://youtu.be/FFCbU4ujCiY?t=533.
In ADF, you cannot access the underlying compute engines (Databricks in this case), so you have to kick-off a dummy data flow to warm it up.
That cluster start-up will take 5-6 mins. But now, if you use that same Azure IR in your subsequent activities, as long as they are scheduled to execute within that TTL window, ADF can grab existing VM resources to spin-up the Spark clusters and marshall your data flow definition to the Spark job execution.
End-to-end that process should now take just 2 mins.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With