Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to best run Apache Airflow tasks on a Kubernetes cluster?

What we want to achieve:

We would like to use Airflow to manage our machine learning and data pipeline while using Kubernetes to manage the resources and schedule the jobs. What we would like to achieve is for Airflow to orchestrate the workflow (e.g. Various tasks dependencies. Re-run jobs upon failures) and Kubernetes to orchestrate the infrastructure (e.g cluster autoscaling and individual jobs assignment to nodes). In other words Airflow will tell the Kubernetes cluster what to do and Kubernetes decides how to distribute the work. In the same time we would also want Airflow to be able to monitor the individual tasks status. For example if we have 10 tasks spreaded across a cluster of 5 nodes, Airflow should be able to communicate with the cluster and reports show something like: 3 “small tasks” are done, 1 “small task” has failed and will be scheduled to re-run and the remaining 6 “big tasks” are still running.

Questions:

Our understanding is that Airflow has no Kubernetes-Operator, see open issues at https://issues.apache.org/jira/browse/AIRFLOW-1314. That being said we don’t want Airflow to manage resources like managing service accounts, env variables, creating clusters, etc. but simply send tasks to an existing Kubernetes cluster and let Airflow know when a job is done. An alternative would be to use Apache Mesos but it looks less flexible and less straightforward compared to Kubernetes.

I guess we could use Airflow’s bash_operator to run kubectl but this seems not like the most elegant solution.

Any thoughts? How do you deal with that?

like image 704
Ricky Lui Avatar asked Jun 19 '18 10:06

Ricky Lui


People also ask

Can we use Airflow on Kubernetes?

Apache Airflow aims to be a very Kubernetes-friendly project, and many users run Airflow from within a Kubernetes cluster in order to take advantage of the increased stability and autoscaling options that Kubernetes provides.

How can I improve my Kubernetes performance?

Best practices to improve Kubernetes performance. To enhance Kubernetes performance, focus on defining resource limits, using optimized and lightweight container images, and deploying clusters closer to your users.

What is the smallest deployable resource in Kubernetes?

Pods are the smallest deployable units of computing that you can create and manage in Kubernetes.


1 Answers

Airflow has both a Kubernetes Executor as well as a Kubernetes Operator.

You can use the Kubernetes Operator to send tasks (in the form of Docker images) from Airflow to Kubernetes via whichever AirflowExecutor you prefer.

Based on your description though, I believe you are looking for the KubernetesExecutor to schedule all your tasks against your Kubernetes cluster. As you can see from the source code it has a much tighter integration with Kubernetes.

This will also allow you to not have to worry about creating the docker images ahead of time as is required with the Kubernetes Operator.

like image 200
andscoop Avatar answered Oct 13 '22 22:10

andscoop