Is there a way to reuse a single running databricks cluster in multiple mapping data flows

Question

Is there a way to reuse a databricks cluster that is started by a web activity before we run the mapping data flows and use the same running cluster in all of the data flows instead of letting all the data flow instances spin up their own clusters which takes around 6 minutes for setting up each cluster?

Mark Kromer MSFT · Accepted Answer

Yes. Set the TTL in the Azure Integration Runtime under "Data Flow Properties" to an amount of time that there is a gap in between data flow job executions. This way, we can set-up a VM pool for you and reuse those resource to minimize the cluster start-up time: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380.

To start the cluster, don't use a web activity. Use a "dummy" data flow as I demonstrate here: https://youtu.be/FFCbU4ujCiY?t=533.

In ADF, you cannot access the underlying compute engines (Databricks in this case), so you have to kick-off a dummy data flow to warm it up.

That cluster start-up will take 5-6 mins. But now, if you use that same Azure IR in your subsequent activities, as long as they are scheduled to execute within that TTL window, ADF can grab existing VM resources to spin-up the Spark clusters and marshall your data flow definition to the Spark job execution.

End-to-end that process should now take just 2 mins.

Is there a way to reuse a single running databricks cluster in multiple mapping data flows

Tags:

azure-data-factory

azure-databricks

1 Answers

Mark Kromer MSFT

Recent Activity

Donate For Us

Is there a way to reuse a single running databricks cluster in multiple mapping data flows

Tags:

azure-data-factory

azure-databricks

1 Answers

Mark Kromer MSFT

Related questions

Recent Activity

Donate For Us