Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Data stored in MLMD in TensorFlow TFX

As far as I understand, TensorFlow uses MLMD to record and retrieve metadata associated with workflows. This may include:

  1. results of pipeline components
  2. metadata about artifacts generated through the components of the pipelines
  3. metadata about executions of these components
  4. metadata about the pipeline and associated lineage information

Features:

Does the above (e.g. #1 aka "results of components") imply that MLMD stores actual data? (e.g. input features for ML training?). If not, what does it mean by results of pipeline components?

Orchestration and pipeline history:

Also, when using TFX with e.g. AirFlow, which uses its own metastore (e.g. metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections) does MLMD store redundant information? Does it supersede it?

like image 252
Josh Avatar asked Jul 06 '20 20:07

Josh


2 Answers

TFX is a ML pipeline/workflow so when you write a TFX application what you are doing is essentially constructing the structure of the workflow and preparing the WF to accept a particular set of data and process or use it (transformations, model build, inference, deploy etc.). So in that aspect it never stores the actual data, it stores the information (metadata) in order to process or use the data. So for example in the condition where it checks anomalies, it requires to remember the previous data schema/stats (not the actual data), so it saves that information as metadata in the MLMD; besides the actual run metadata. In terms of Airflow it will also save the run metadata. This can be seen as a subset of all the metadata, very limited in comparison to the metadata saved in MLMD. There will be a redundancy involved though. And the controller is TFX that defines and makes use of the underlining Airflow orchestration. It will not supersede but it will definitely fail if there is a clash.

like image 112
michael dsouza Avatar answered Sep 29 '22 12:09

michael dsouza


Imagine the filesystem of a disk drive. The contents of the files are stored in the disk, but it's the index and the pointers to these data that is called filesystem. That metadata that brings value to the user who can find the relevant data when they need them, by searching or navigating through the filesystem.

Similarly with MLMD, it stores the metadata of a ML pipeline, like which hyperparameters you've used in an execution, which version of training data, how was the distribution of the features, etc. But it's beyond being just a registry of the runs. These metadata can be used to empower two killer features of a ML pipeline tool:

  1. asynchronous execution of its components, for example retrain a model when there are new data, without necessary having a new vocabulary generated
  2. reuse results from previous runs, or step-level output caching. For example, do not run a step if its input parameters haven't changed, but reuse the output of a previous run from the cache to feed the next component.

So yes, the actual data are indeed stored in a storage, maybe a cloud bucket, in form of parquet files across transformations, or model files and schemata protobufs. And MLMD stores the uri to these data with some meta information. For example, a savedmodel is stored in s3://mymodels/1, and it has an entry in the Artifacts table of MLMD, with a relation to the Trainer run and it's TrainArgs parameters on the ContextProperty table.

If not, what does it mean by results of pipeline components?

It means the pointers to the data which have been generated by the run of a component, including the input parameters. In our previous example, if the input data as well as the the TrainArgs of a Trainer component haven't changed in a run, it shouldn't run again that expensive component, but reuse the modelfile from the cache.

This requirement of a continuous ML pipeline makes the use of workflow managers such as Tekton or Argo more relevant compared to Airflow, and MLMD a more focused metadata store compared to the later.

like image 23
Theofilos Papapanagiotou Avatar answered Sep 29 '22 12:09

Theofilos Papapanagiotou