Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to chain Azure Data Factory pipelines

I have a data factory with multiple pipelines and each pipeline has around 20 copy activities to copy azure tables between 2 storage accounts.

Each pipeline handles a snapshot of each azure table hence i want to run pipelines sequentially to avoid the risk of overwriting latest data with old data.

I know that giving first pipeline output as input to the 2nd pipeline we can achieve this. But as i have many activities in a pipeline, i am not sure which activity will complete last.

Is there anyway i can know that pipeline is completed or anyway one pipeline completed status triggers the next pipeline ?

In Activity, inputs is an array. So is it possible to give multiple inputs ? If yes all inputs will run asynchronously or one after the other ?

In the context of multiple inputs i have read about Scheduling dependency. So can an external input act as scheduling dependency or only internal dataset ?

like image 902
Venky Avatar asked May 25 '17 14:05

Venky


2 Answers

This is an old one but i was still having this issue with datafactory 2 so in case anyone has come here looking for this solution on datafactory 2. The “Wait on completion” tick box setting is hidden under the 'Advanced' part of the Settings tab for the Execute Pipeline activity. Just check it to get the desired result.

Note the 'Advanced' bit on the setting tab is not the same as the 'Advanced' free coding tab. See screen shot:

enter image description here

like image 99
Jim Avatar answered Oct 02 '22 07:10

Jim


I think currently you have a couple of options to dealing with this. Neither are really ideal, but nothing in ADF is ideal in its current form! So...

Option 1

Enforce a time slice delay or offset on the second pipeline activities. A delay would be easier to change without re-provisioning slices and can be added to an activity. This wouldn't be event driven, but would give you a little more control to avoid overlaps.

"policy": {
    "timeout": "1.00:00:00",
    "delay": "02:00:00",  // <<<< 2 hour delay
    "concurrency": 1,

Check this page for more info on both attributes and where to use them: https://docs.microsoft.com/en-us/azure/data-factory/data-factory-scheduling-and-execution

Option 2

Break at the PowerShell and use something at a higher level to control this.

For example, use Get-​Azure​Rm​Data​Factory​Activity​Window to check the first pipelines state. Then if complete use Set-AzureRmDataFactorySliceStatus to update the second pipelines datasets to ready.

OR

Do this at a pipeline level with Suspend-​Azure​Rm​Data​Factory​Pipeline

More info on ADF PowerShell cmdlets here: https://docs.microsoft.com/en-gb/powershell/module/azurerm.datafactories/Suspend-AzureRmDataFactoryPipeline?view=azurermps-4.0.0

As I say, neither options are ideal and you've already mentioned dataset chaining in your question.

Hope this helps.

like image 44
Paul Andrew Avatar answered Oct 05 '22 07:10

Paul Andrew