Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Kubeflow vs other options [closed]

Tags:

kubeflow

I am trying to find when it makes sense to create your own Kubeflow MLOps platform:

  • If you are Tensorflow only shop, do you still need Kubeflow? Why not TFX only? Orchestration can be done with Airflow.
  • Why use Kubeflow if all you are using scikit-learn as it does not support GPU, distributed training anyways? Orchestration can be done with Airflow.
  • If you are convinced to use Kubeflow, cloud providers (Azure and GCP) are delivering ML pipeline concept (Google is using Kubeflow under the hood) as managed services. When it makes sense to deploy your own Kubeflow environment then? Even if you have a requirement to deploy on-prem, you have the option to use the cloud resources (nodes and data on cloud) to train your models, and only deploy the model to on-prem. Thus, using Azure or GCP AI Platform as managed service makes the most sense to deliver ML pipelines?
like image 297
Cengiz Avatar asked Mar 21 '20 12:03

Cengiz


People also ask

Which is better MLflow or Kubeflow?

Use Kubeflow if you want to track your machine learning experiments and deploy your solutions in a more customized way, backed by Kubernetes. Use MLFlow if you want a simpler approach to experiment tracking and want to deploy to managed platforms such as Amazon Sagemaker.

What is the difference between airflow and Kubeflow?

Airflow is purely a pipeline orchestration platform but Kubeflow can do much more than orchestration. As a matter of fact, Kubeflow focuses majorly on machine learning tasks, like experiment tracking. In Kubeflow, an experiment is a workspace that empowers you to make different configurations of your pipelines.

What is the difference between Kubeflow and Kubernetes?

Kubernetes takes care of resource management, job allocation, and other operational problems that have traditionally been time-consuming. Kubeflow allows engineers to focus on writing ML algorithms instead of managing their operations.

Is Kubeflow popular?

Kubeflow has become quite popular in the MLOps community as the tool that enables data science teams to automate their workflows from data preprocessing to model deployment on Kubernetes.


1 Answers

Building an MLOps platform is an action companies take in order to accelerate and manage the workflow of their data scientists in production. This workflow is reflected in ML pipelines, and includes the 3 main tasks of feature engineering, training and serving.

Feature engineering and model training are tasks which require a pipeline orchestrator, as they have dependencies of subsequent tasks and that makes the whole pipeline prone to errors.

Software building pipelines are different from data pipelines, which are in turn different from ML pipelines.

A software CI/CD flow compiles the code to deploy-able artifacts and accelerates the software delivery process. So, code in, artifact out. It's being achieved by the invocation of compilation tasks, execution of tests and deployment of the artifact. Dominant orchestrators for such pipelines are Jenkins, Gitlab-CI, etc.

A data processing flow gets raw data and performs transformation to create features, aggregations, counts, etc. So data in, data out. This is achieved by the invokation of remote distributed tasks, which perform data transformations by storing intermediate artifacts in data repositories. Tools for such pipelines are Airflow, Luigi and some hadoop ecosystem solutions.

In the machine learning flow, the ML engineer writes code to train models, uses the data to evaluate them and then observes how they perform in production in order to improve them. So code and data in, model out. Hence the implementation of such a workflow requires a combination of the orchestration technologies we've discussed above.

TFX present this pipeline and proposes the use of components that perform these subsequent tasks. It defines a modern, complete ML pipeline, from building the features, to running the training, evaluating the results, deploying and serving the model in production

Kubernetes is the most advanced system for orchestrating containers, the defacto tool to run workloads in production, the cloud-agnostic solution to save you from a cloud vendor lock-in and hence optimize your costs.

Kubeflow is positioned as the way to do ML in Kubernetes, by implementing TFX. Eventually it handling the code and data in, model out. It provides a coding environment by implementing jupyter notebooks in the form of kubernetes resources, called notebooks. All cloud providers are onboard with the project and implement their data loading mechanisms across KF's components. The orchestration is implemented via KF pipelines and the serving of the model via KF serving. The metadata across its components are specified in the specs of the kubernetes resources throughout the platform.

In Kubeflow, the TFX components exist in the form of reusable tasks, implemented as containers. The management of the lifecycle of these components is achieved through Argo, the orchestrator of KF pipelines. Argo implements these workflows as kubernetes CRDs. In a workflow spec we define the dag tasks, the TFX components as containers, the metadata which will be written in the metadata store, etc. The execution of these workflows is happening nicely using standard kubernetes resources like pods, as well as custom resource definitions like experiments. That makes the implementation of the pipeline and the components language-agnostic, unline Airflow which implements the tasks in python only. These tasks and their lifecycle is then managed natively by kubernetes, without the need to use duct-tape solutions like Airflow's kubernetes-operator. Since everything is implemented as kubernetes resources, everything is a yaml and so the most Git friendly configuration you can find. Good luck trying to enforce version control in Airflow's dag directory.

The deployment and management of the model in production is done via KF serving using the CRD of inferenceservice. It utilizes Istio's secure access to the models via its virtualservices, serverless resources using Knative Serving's scale-from-zero pods, revisions for versioning, prometheus metrics for observability, logs in ELK for debugging and more. Running models in production could not be more SRE friendly than that.

On the topic of splitting training/serving between cloud and on-premise, the use of kubernetes is even more important, as it abstracts the custom infrastructure implementation of each provider, and so provides a unified environment to the developer/ml engineer.

like image 60
Theofilos Papapanagiotou Avatar answered Oct 24 '22 09:10

Theofilos Papapanagiotou