Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TensorFlow Extended (TFX): Clarify Beam, Airflow and Kubeflow usage

I'm hoping someone can clarify the relationship between TensorFlow and its dependencies (Beam, AirFlow, Flink,etc)

I'm referencing the main TFX page: https://www.tensorflow.org/tfx/guide#creating_a_tfx_pipeline_with_airflow ,etc.

In the examples, I see three variants: https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi_pipeline taxi_pipeline_flink.py, taxi_pipeline_kubeflow.py, taxi_pipeline_simple.py

BEAM Example?

There is no "BEAM" example and little describing its use.

Is it correct to assume that taxi_pipeline_simple.py would run even if airflow wasn't installed? I think not since it uses "AirflowDAGRunner". If not, then can you run TFX with only BEAM and its runner? If so, why no example of that?

Flink Example

In taxi_pipeline_flink.py, AirflowDAGRunner is used. I assume that is using AirFlow as an orchestrator which in turn uses Flink as its executor. Correct?

Airflow Example

The page states that BEAM is a required dependency, yet airflow doesn't have beam as one of its executors. It only has SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, and KubernetesExecutor. Therefore, is BEAM only needed when not using Airflow? When using airflow, what is the purpose of beam if it is required?

Thank you for any insights.

like image 809
Robert Lugg Avatar asked May 17 '19 20:05

Robert Lugg


People also ask

What is TFX used for?

TFX is a Google-production-scale machine learning toolkit based on TensorFlow. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system.

Who uses TFX?

TFX is used by thousands of users within Alphabet, and it powers hundreds of popular Alphabet products, including Cloud AI services on Google Cloud Platform (GCP).

Can I use TFX with PyTorch?

With some consideration of the needs of a particular modeling framework, a TFX pipeline can be used to train models in any other Python-based ML framework. This includes Scikit-learn, XGBoost, and PyTorch, among others.

What is TensorFlow extended?

TensorFlow Extended (TFX) is a Google-production-scale machine learning platform based on TensorFlow. It provides a configuration framework to express ML pipelines consisting of TFX components. TFX pipelines can be orchestrated using Apache Airflow and Kubeflow Pipelines.


1 Answers

A) In order to run TFX pipelines, you need orchestrators. Examples are Apache Airflow, Kubeflow Pipelines and Apache Beam.

B) Apache Beam is ALSO (and maybe mainly) used for distributed data processing in some TFX components. Therefore, Apache Beam is necessary with any orchestrators you choose (even if you don't use Apache Beam as orchestrator!)

Answering your points:

1) BEAM Example - Right now there is a Beam example at https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_beam.py. As you correctly expected, there is no AirflowDAGRunner there, since this example does not use Airflow as orchestrator.

2) Airflow example - BEAM is a required dependency because of the reason stated above: BEAM is always used by TFX for distributed data processing in some components. Therefore, even with Airflow (or any other) as orchestrator, you need BEAM.

3) Flink example - at the moment, I cannot find this example anywhere (probably due to changes to the link since you posted), but it is possible that Flink would be used as a runner, while Airflow is the orchestrator. However, I couldn't find mentions to Flink in Airflow's documentation.

Hope it helps to some extent.

like image 176
Elisio Quintino Avatar answered Nov 03 '22 00:11

Elisio Quintino