As far as I understand, TensorFlow uses MLMD to record and retrieve metadata associated with workflows. This may include:
Features:
Does the above (e.g. #1 aka "results of components") imply that MLMD stores actual data? (e.g. input features for ML training?). If not, what does it mean by results of pipeline components?
Orchestration and pipeline history:
Also, when using TFX with e.g. AirFlow, which uses its own metastore (e.g. metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections) does MLMD store redundant information? Does it supersede it?
TFX is a ML pipeline/workflow so when you write a TFX application what you are doing is essentially constructing the structure of the workflow and preparing the WF to accept a particular set of data and process or use it (transformations, model build, inference, deploy etc.). So in that aspect it never stores the actual data, it stores the information (metadata) in order to process or use the data. So for example in the condition where it checks anomalies, it requires to remember the previous data schema/stats (not the actual data), so it saves that information as metadata in the MLMD; besides the actual run metadata. In terms of Airflow it will also save the run metadata. This can be seen as a subset of all the metadata, very limited in comparison to the metadata saved in MLMD. There will be a redundancy involved though. And the controller is TFX that defines and makes use of the underlining Airflow orchestration. It will not supersede but it will definitely fail if there is a clash.
Imagine the filesystem of a disk drive. The contents of the files are stored in the disk, but it's the index and the pointers to these data that is called filesystem. That metadata that brings value to the user who can find the relevant data when they need them, by searching or navigating through the filesystem.
Similarly with MLMD, it stores the metadata of a ML pipeline, like which hyperparameters you've used in an execution, which version of training data, how was the distribution of the features, etc. But it's beyond being just a registry of the runs. These metadata can be used to empower two killer features of a ML pipeline tool:
So yes, the actual data are indeed stored in a storage, maybe a cloud bucket, in form of parquet files across transformations, or model files and schemata protobufs. And MLMD stores the uri to these data with some meta information. For example, a savedmodel is stored in s3://mymodels/1
, and it has an entry in the Artifacts
table of MLMD, with a relation to the Trainer
run and it's TrainArgs
parameters on the ContextProperty
table.
If not, what does it mean by results of pipeline components?
It means the pointers to the data which have been generated by the run of a component, including the input parameters. In our previous example, if the input data as well as the the TrainArgs
of a Trainer
component haven't changed in a run, it shouldn't run again that expensive component, but reuse the modelfile from the cache.
This requirement of a continuous ML pipeline makes the use of workflow managers such as Tekton
or Argo
more relevant compared to Airflow
, and MLMD a more focused metadata store compared to the later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With