Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the purpose of data provenance in Apache NiFi Processors

Tags:

apache-nifi

For every processor there is a way to configure the processor and there is a context menu to view data provenance.

Is there a good explanation of what is data provenance?

NiFi screenshot

like image 246
Santanu Dey Avatar asked Aug 15 '16 02:08

Santanu Dey


People also ask

Why is data provenance?

Data provenance is important for tracking down errors within data and attributing them to sources. Additionally, data provenance can be useful in reporting and auditing for business and research processes.

What is the primary purpose of Apache NiFi?

Apache NiFi is an integrated data logistics platform for automating the movement of data between disparate systems. It provides real-time control that makes it easy to manage the movement of data between any source and any destination.

How does NiFi manages the data flow between the processors?

The core concepts of NiFi A FlowFile represents each object moving through the system and for each one, NiFi keeps track of a map of key/value pair attribute strings and its associated content of zero or more bytes. Processors actually perform the work.


1 Answers

Data provenance is all about understanding the origin and attribution of data. In a typical system you get 'logs'. When you consider data flowing through a series of processes and queues you end up with a lot of lots of course. If you want to follow the path a given piece of data took, or how long it took to take that path, or what happened to an object that got split up into different objects and so on all of that is really time consuming and tough. The provenance that NiFi supports is like logging on steroids and is all about keeping and tracking these relationships between data and the events that shaped and impacted what happened to it. NiFi is keeping track of where each piece of data comes from, what it learned about the data, maintains the trail across splits, joins, transformations, where it sends it, and ultimately when it drops the data. Think of it like a chain of custody for data.

This is really valuable for a few reasons. First, understanding and debugging. Having this provenance capture means from a given event you can go forwards or backwards in the flow to see where data came from and went. Given that NiFi also has an immutable versioned content store under the covers you can also use this to click directly to the content at each stage of the flow. You can also replay the content and context of a given event against the latest flow. This in turn means much faster iteration to the configuration and results you want. This provenance model is also valuable for compliance reasons. You can prove whether you sent data to the correct systems or not. If you learn that you didn't then have data with which you can address the issue or create a powerful audit trail for follow-up.

The provenance model in Apache NiFi is really powerful and it is being extended to the Apache MiNiFi which is a subproject of Apache NiFi as well. More systems producing more provenance will mean you have a far stronger ability to track data from end-to-end. Of course this becomes even more powerful when it can be combined with other lineage systems or centralized lineage stores. Apache Atlas may be a great system to integrate with for this to bring a centralized view. NiFi is able to not only do what I described above but to also send these events to such a central store. So, exciting times ahead for this.

Hope that helps.

like image 133
Joe Witt Avatar answered Sep 27 '22 19:09

Joe Witt