Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Azure data factories vs factory

I'm building an Azure data lake using data factory at the moment, and am after some advice on having multiple data factories vs just one.

I have one data factory at the moment, that is sourcing data from one EBS instance, for one specific company under an enterprise. In the future though there might be other EBS instances, and other companies (with other applications as sources) to incorporate into the factory - and I'm thinking the diagram might get a bit messy.

I've searched around, and I found this site, that recommends to keep everything in a single data factory to reuse linked services. I guess that is a good thing, however as I have scripted the build for one data factory, it would be pretty easy to build the linked services again to point at the same data lake for instance.

https://www.purplefrogsystems.com/paul/2017/08/chaining-azure-data-factory-activities-and-datasets/

Pros for having only one instance of data factory:

  • have to only create the data sets, linked services once
  • Can see overall lineage in one diagram

Cons

  • Could get messy over time
  • Could get quite big to even find the pipeline you are after

Has anyone got some large deployments of Azure Data Factories out there, that bring in potentially thousands of data sources, mix them together and transform? Would be interested in hearing your thoughts.

like image 784
mitroberts Avatar asked Jan 11 '18 02:01

mitroberts


Video Answer


2 Answers

My suggestion is to have only one, as it makes it easier to configure multiple integration runtimes (gateways). If you decide to have more than one data factory, take into consideration that a pc can only have 1 integration runtime installed, and that the integration runtime can only be registered to only 1 data factory instance.

I think the cons you are listing are both fixed by having a naming rules. Its not messy to find a pipeline you want if you name them like: Pipeline_[Database name][db schema][table name] for example.

I have a project with thousands of datasets and pipelines, and its not harder to handle than smaller projects.

Hope this helped!

like image 187
Martin Esteban Zurita Avatar answered Sep 20 '22 12:09

Martin Esteban Zurita


I'd initially agree with an integration runtime being tied to a single data factory being a restriction, however I suspect it is no longer or soon to be no longer a restriction.

In the March 13th update to AzureRm.DataFactories, there is a comment stating "Enable integration runtime to be shared across data factory".

I think it will depend on the complexity of the data factory and if there are inter-dependencies between the various sources and destinations.

The UI particularly (even more so in V2) makes managing a large data factory easy.

However if you choose an ARM deployment technique the data factory JSON can soon become unwieldy in even a modestly complex data factory. And in that sense I'd recommend splitting them.

You can of course mitigate maintainability issues as people have mentioned, by breaking your ARM templates into nested deployments, ARM parameterisation or data factory V2 parameterisation, using the SDK direct with separate files. Or even just use the UI (now with git support :-) )

Perhaps more importantly particularly as you mention separate companies being sourced from; it perhaps sounds like the data isn't related and if it isn't - should it be isolated to avoid any coding errors? Or perhaps even to have segregated roles and responsibilities for the data factories.

On the other hand if the data is interrelated, having it in one data factory makes things far easier for allowing data factory to manage data dependencies and re-running failed slices in one go.

like image 37
Alex KeySmith Avatar answered Sep 19 '22 12:09

Alex KeySmith