Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What are the differences between Data Lineage and Data Provenance?

From wiki,

Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sources.

Data provenance documents the inputs, entities, systems, and processes that influence data of interest, in effect providing a historical record of the data and its origins.

It seems that both concepts are talking about about where the data comes from but I'm still confused about the differences. Are both the concepts the same? If they are different, can someone shares an example?

Thanks,

like image 250
CSY Avatar asked Apr 13 '17 03:04

CSY


People also ask

What is data provenance?

Data provenance is the documentation of where a piece of data comes from and the processes and methodology by which it was produced. Put simply, provenance answers the questions of why and how the data was produced, where, when and by whom.

What are the different types of data lineage?

There are two different types of data lineage — business lineage and technical lineage. Rudimentary data lineage solutions only have business lineage; more advanced data lineage tools have both business and technical lineage. Business lineage provides only a summary view.

What is data lineage with example?

It involves evaluation of metadata for tables, columns, and business reports. Using this metadata, it investigates lineage by looking for patterns. For example, if two datasets contain a column with a similar name and very data values, it is very likely that this is the same data in two stages of its lifecycle.

Is data lineage part of data governance?

Data governance refers to the rules and processes imposed on maintaining data in a company. Data lineage is the part of data governance that records the movement of data from its original source through any system in between that source and the data's destination.


2 Answers

From our experience, data provenance includes only high level view of the system for business users, so they can roughly navigate where their data come from. It's provided by variety of modeling tools or just simple custom tables and charts. Data lineage is a more specific term and includes two sides - business (data) lineage and technical (data) lineage. Business lineage pictures data flows on a business-term level and it's provided by solutions like Collibra, Alation and many others. Technical data lineage is created from actual technical metadata and tracks data flows on the lowest level - actual tables, scripts and statements. Technical data lineage is being provided by solutions such as MANTA or Informatica Metadata Manager.

like image 127
Jan Andrs Avatar answered Oct 11 '22 03:10

Jan Andrs


Data Provenance is,

data lineage (what is the genealogy,history of its journey, where did it begin, how did it come into being, how did it change over time, where has it been, systems it has traveled, any loss or gain) (i.e. data oriented, metadata)

PLUS

the inputs, entities, systems and processes that influenced the data (i.e. process oriented) which can be used to reproduce the data.

like image 26
Sam M Avatar answered Oct 11 '22 03:10

Sam M