I realize that with NiFi, as their doc defines it, "continuous improvement occurs in production". So this doesn’t lend itself to be used as a traditional development tool. However for the project I’m working on it’s been decided that this is the tool we’ll be using, so I'd rather not debate the merits of this as I realize there are going to be some issues.
For example if I push changes into an existing environment (from staging to production) and there were live edits in the destination, they are going to get overwritten. So I have questions on how to organize the development life cycle.
Apache NiFi helps to manage and automate the flow of data between the systems. It can easily manage the data transfer between source and destination systems. It can be described as data logistics. Apache NiFi helps to move and track data similar to the parcel services as how data move and track.
The NiFi Documentation is up to date and explain nicely how clusters work. To answer your question in short, if a node fails, the data that was on that node when it failed will require manual intervention to recover. If you lose the storage on the failed node, you lose the data on that node.
Apache NiFi is a dataflow system based on the concepts of flow-based programming. It supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. NiFi has a web-based user interface for design, control, feedback, and monitoring of dataflows.
Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination.
As the original author of the item you quoted and a member of the Apache NiFi PMC let me start by saying you're asking great questions and I can appreciate where you're coming from. We should probably improve the introduction document to better reflect the concerns you're raising.
You have it right that the current approach is to create templates of the flows and then you can submit that to version control. It is also the case that folks automate the deployment of these templates using scripts interacting with NiFi's REST API. But we can and should do far more than we have to make the dataflow management job easier regardless of whether you're a developer writing precisely what will be deployed or whether you're an operations focused person having to put these pieces together yourself.
Elements of 1 and 2 will be present in the upcoming 1.0 release and item 3 is totally covered in that upcoming release. In the mean time for the multi-developer case I think it makes sense for them to treat their own local instance as a place for 'unit testing' their flow and then using a shared staging or production environment. The key thing to keep in mind is that for many flows and with NiFi's approach it is ok to have multiple instances of a given flow template executing each being fed the live feed of data. The results/output of that flow can be wired to actually get delivered somewhere or simply be grounded. In this way it is a lot like the mental model of branching in source control such as Git. You get to choose which one you consider 'production' versus which flow on the graph is simply an ongoing feature branch if you will. For people coming from the more traditional approach this is not obvious and we need to do more to describe and demonstrate this. However, we should also support more traditional approaches as well and that is what some of the feature proposals I've linked to will enable.
[1] https://cwiki.apache.org/confluence/display/NIFI/Configuration+Management+of+Flows
[2] https://cwiki.apache.org/confluence/display/NIFI/Extension+Registry
[3] https://cwiki.apache.org/confluence/display/NIFI/Variable+Registry
[4] https://cwiki.apache.org/confluence/display/NIFI/Multi-Tentant+Dataflow
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With