I'm hoping someone can clarify the relationship between TensorFlow and its dependencies (Beam, AirFlow, Flink,etc)
I'm referencing the main TFX page: https://www.tensorflow.org/tfx/guide#creating_a_tfx_pipeline_with_airflow ,etc.
In the examples, I see three variants:
https://github.com/tensorflow/tfx/tree/master/tfx/examples/chicago_taxi_pipeline
taxi_pipeline_flink.py
, taxi_pipeline_kubeflow.py
, taxi_pipeline_simple.py
There is no "BEAM" example and little describing its use.
Is it correct to assume that taxi_pipeline_simple.py
would run even if airflow wasn't installed? I think not since it uses "AirflowDAGRunner". If not, then can you run TFX with only BEAM and its runner? If so, why no example of that?
In taxi_pipeline_flink.py
, AirflowDAGRunner is used. I assume that is using AirFlow as an orchestrator which in turn uses Flink as its executor. Correct?
The page states that BEAM is a required dependency, yet airflow doesn't have beam as one of its executors. It only has SequentialExecutor, LocalExecutor, CeleryExecutor, DaskExecutor, and KubernetesExecutor. Therefore, is BEAM only needed when not using Airflow? When using airflow, what is the purpose of beam if it is required?
Thank you for any insights.
TFX is a Google-production-scale machine learning toolkit based on TensorFlow. It provides a configuration framework and shared libraries to integrate common components needed to define, launch, and monitor your machine learning system.
TFX is used by thousands of users within Alphabet, and it powers hundreds of popular Alphabet products, including Cloud AI services on Google Cloud Platform (GCP).
With some consideration of the needs of a particular modeling framework, a TFX pipeline can be used to train models in any other Python-based ML framework. This includes Scikit-learn, XGBoost, and PyTorch, among others.
TensorFlow Extended (TFX) is a Google-production-scale machine learning platform based on TensorFlow. It provides a configuration framework to express ML pipelines consisting of TFX components. TFX pipelines can be orchestrated using Apache Airflow and Kubeflow Pipelines.
A) In order to run TFX pipelines, you need orchestrators. Examples are Apache Airflow, Kubeflow Pipelines and Apache Beam.
B) Apache Beam is ALSO (and maybe mainly) used for distributed data processing in some TFX components. Therefore, Apache Beam is necessary with any orchestrators you choose (even if you don't use Apache Beam as orchestrator!)
Answering your points:
1) BEAM Example - Right now there is a Beam example at https://github.com/tensorflow/tfx/blob/master/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_beam.py. As you correctly expected, there is no AirflowDAGRunner there, since this example does not use Airflow as orchestrator.
2) Airflow example - BEAM is a required dependency because of the reason stated above: BEAM is always used by TFX for distributed data processing in some components. Therefore, even with Airflow (or any other) as orchestrator, you need BEAM.
3) Flink example - at the moment, I cannot find this example anywhere (probably due to changes to the link since you posted), but it is possible that Flink would be used as a runner, while Airflow is the orchestrator. However, I couldn't find mentions to Flink in Airflow's documentation.
Hope it helps to some extent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With