I have multiple dags using Celery Executor but I want one particular dag to run using Kubernetes Executor. I am unable to deduce a good and reliable way to achieve this. I have an <code>airflow.cfg</code> in which I have declared <code>CeleryExecutor</code> to be used. And I don't want to change it since it is really needed in all the dags but one. <pre class="prettyprint"><code># The executor class that airflow should use. Choices include # SequentialExecutor, LocalExecutor, CeleryExecutor executor = CeleryExecutor </code></pre> My dag code: <pre class="prettyprint"><code>from datetime import datetime, timedelta from airflow import DAG from airflow.contrib.operators.kubernetes_pod_operator import \ KubernetesPodOperator from airflow.operators.dummy_operator import DummyOperator default_args = { 'owner': 'airflow', 'depends_on_past': False, 'start_date': datetime.utcnow(), 'email': ['airflow@example.com'], 'email_on_failure': False, 'email_on_retry': False, 'retries': 1, 'retry_delay': timedelta(minutes=5) } dag = DAG( 'kubernetes_sample_1', default_args=default_args) start = DummyOperator(task_id='run_this_first', dag=dag) passing = KubernetesPodOperator(namespace='default', image="Python:3.6", cmds=["Python", "-c"], arguments=["print('hello world')"], labels={"foo": "bar"}, name="passing-test", task_id="passing-task", get_logs=True, dag=dag ) failing = KubernetesPodOperator(namespace='default', image="ubuntu:1604", cmds=["Python", "-c"], arguments=["print('hello world')"], labels={"foo": "bar"}, name="fail", task_id="failing-task", get_logs=True, dag=dag ) passing.set_upstream(start) failing.set_upstream(start) </code></pre> I can put an if-else condition and then change the value from the point where Airflow picks up the configuration. If this sounds right, please tell me the paths and the files. Although I was hoping to get a more mature method, if it exists.

Now there is the CeleryKubernetesExecutor (can't see when it was exactly introduced), which requires to set up Celery and Kubernetes up, but also offers the functionalities from both. In the official documentation, they offer a rule of thumb to decide when it's worth using it: <blockquote> We recommend considering the CeleryKubernetesExecutor when your use case meets: The number of tasks needed to be scheduled at the peak exceeds the scale that your Kubernetes cluster can comfortably handle A relative small portion of your tasks requires runtime isolation. You have plenty of small tasks that can be executed on Celery workers but you also have resource-hungry tasks that will be better to run in predefined environments. </blockquote>

Starting Airflow 2.x configure <code>airflow.cfg</code> as follows: In <code>[core]</code> section set <code>executor = CeleryKubernetesExecutor</code> and in <code>[celery_kubernetes_executor]</code> section set <code>kubernetes_queue = kubernetes</code>. So whenever you want to run a task instance in the kubernetes executor, add the parameter <code>queue = kubernetes</code> in the task definition. for eg. <pre class="prettyprint"><code>task1= BashOperator( task_id='Test_kubernetes_executor', bash_command='echo Kubernetes', queue = 'kubernetes' ) task2 = BashOperator( task_id='Test_Celery_Executor', bash_command='echo Celery', ) </code></pre> On running the dag you will see task1 running in k8s and task2 in celery. Hence unless you write the queue as kubernetes, all dag will run on celery executor

How to have a mix of both Celery Executor and Kubernetes Executor in Apache Airflow?

Tags:

python

python-3.x

kubernetes

celery

airflow

I have multiple dags using Celery Executor but I want one particular dag to run using Kubernetes Executor. I am unable to deduce a good and reliable way to achieve this.

I have an airflow.cfg in which I have declared CeleryExecutor to be used. And I don't want to change it since it is really needed in all the dags but one.

# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor
executor = CeleryExecutor

My dag code:

from datetime import datetime, timedelta

from airflow import DAG
from airflow.contrib.operators.kubernetes_pod_operator import \
    KubernetesPodOperator
from airflow.operators.dummy_operator import DummyOperator

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'start_date': datetime.utcnow(),
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=5)
}

dag = DAG(
    'kubernetes_sample_1', default_args=default_args)


start = DummyOperator(task_id='run_this_first', dag=dag)

passing = KubernetesPodOperator(namespace='default',
                                image="Python:3.6",
                                cmds=["Python", "-c"],
                                arguments=["print('hello world')"],
                                labels={"foo": "bar"},
                                name="passing-test",
                                task_id="passing-task",
                                get_logs=True,
                                dag=dag
                                )

failing = KubernetesPodOperator(namespace='default',
                                image="ubuntu:1604",
                                cmds=["Python", "-c"],
                                arguments=["print('hello world')"],
                                labels={"foo": "bar"},
                                name="fail",
                                task_id="failing-task",
                                get_logs=True,
                                dag=dag
                                )

passing.set_upstream(start)
failing.set_upstream(start)

I can put an if-else condition and then change the value from the point where Airflow picks up the configuration. If this sounds right, please tell me the paths and the files. Although I was hoping to get a more mature method, if it exists.

393

asked May 27 '19 06:05

Aviral Srivastava

2 Answers

Now there is the CeleryKubernetesExecutor (can't see when it was exactly introduced), which requires to set up Celery and Kubernetes up, but also offers the functionalities from both.

In the official documentation, they offer a rule of thumb to decide when it's worth using it:

We recommend considering the CeleryKubernetesExecutor when your use case meets:

The number of tasks needed to be scheduled at the peak exceeds the scale that your Kubernetes cluster can comfortably handle

A relative small portion of your tasks requires runtime isolation.

You have plenty of small tasks that can be executed on Celery workers but you also have resource-hungry tasks that will be better to run in predefined environments.

156

answered Oct 07 '22 10:10

Alessandro S.

Starting Airflow 2.x configure airflow.cfg as follows: In [core] section set executor = CeleryKubernetesExecutor and in [celery_kubernetes_executor] section set kubernetes_queue = kubernetes. So whenever you want to run a task instance in the kubernetes executor, add the parameter queue = kubernetes in the task definition. for eg.

task1= BashOperator(
        task_id='Test_kubernetes_executor',
        bash_command='echo Kubernetes',
        queue = 'kubernetes'
    )
task2 = BashOperator(
        task_id='Test_Celery_Executor',
        bash_command='echo Celery',
    )

On running the dag you will see task1 running in k8s and task2 in celery. Hence unless you write the queue as kubernetes, all dag will run on celery executor

answered Oct 07 '22 10:10

caxefaizan

Related questions
                            
                                Pycharm - Type hints are not installed
                            
                                Detect if python program is executed via Windows GUI (double-click) vs command prompt
                            
                                How to support masking in custom tf.keras.layers.Layer
                            
                                sorting a complicated collection of 2d euclidian points in in clockwise/counterclockwise fashion to form a closed ring
                            
                                Counter in range() not recognized as integer
                            
                                Why does mypy ignore a generic-typed variable that contains a type incompatible with the TypeVar?
                            
                                Is there a way to pass dictionary in tf.data.Dataset w/ tf.py_func?
                            
                                Create new django project in pycharm: "Remote path not provided"
                            
                                Return a response with a list of serializers Django REST Framework
                            
                                Reading Gmail Email in Python
                            
                                How to categorize a range of values in Pandas DataFrame
                            
                                Testing Django FileResponse
                            
                                AttributeError: module 'pyproj' has no attribute 'pyproj_datadir'
                            
                                Runtime error with python code online, works offline
                            
                                Decorating a property: right order
                            
                                Multi-feature causal CNN - Keras implementation
                            
                                How to send custom headers in a Scrapy Splash request?
                            
                                Converting igraph to networkx for clustering
                            
                                Conda install takes forever (stuck as SAT solver)
                            
                                django-taggit not working when using UUID

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With