Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use the DockerOperator from Apache Airflow

This question is related to understanding a concept regarding the DockerOperator and Apache Airflow, so I am not sure if this site is the correct place. If not, please let me know where I can post it.

The situation is the following: I am working with a Windows laptop, I have a developed very basic ETL pipeline that extracts data from some server and writes the unprocessed data into a MongoDB on a scheduled basis with Apache-Airflow. I have a docker-compose.yml file with three services: A mongo service for the MongoDB, a mongo-express service as admin tool for the MongoDB, a webserver service for Apache-Airflow and a postgres service as database backend for Apache-Airflow.

So far, I have developed some Python code in functions and these functions are being called by the Airflow instance using the PythonOperator. Since debugging is very difficult using the PythonOperator, I want to try the DockerOperator now instead. I have been following this tutorial which claims that using the DockerOperator, you can develop your source code independent of the operating system the code will later be executed on due to Docker's concept 'build once, run everywhere'.

My problem is that I didn't fully understand all necessary steps needed to run code using the DockerOperator. In the tutorial, I have the following questions regarding the Task Development and Deployment:

  1. Package the artifacts together with all dependencies into a Docker image. ==> Does this mean that I have to create a Dockerfile for every task and then build an image using this Dockerfile?
  2. Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?

Thanks for your time, I highly appreciate it!

like image 631
Kevin Südmersen Avatar asked Jan 16 '20 08:01

Kevin Südmersen


People also ask

What is DockerOperator?

The DockerOperator wraps around the Docker Python client and, given a list of arguments, enables starting of Docker containers. In Listing 9.11, the docker_url is set to a Unix socket, which requires Docker running on the local machine.

Do you need Docker for Airflow?

In order for running Airflow in Docker, you need to download Docker and Docker compose then start your container after that you can create your own DAG and schedule the tasks or trigger it. Now you can create your own DAGs and run them in Docker.

What is Puckel Airflow?

By puckel • Updated 3 years ago. Airflow is a platform to programmatically author, schedule and monitor workflows.


1 Answers

Typically you're going to have a Docker image that handles one type of task. So for any one pipeline you'd probably be using a variety of different Docker images, one different one for each step.

There are a couple of considerations here in regards to your question which is specifically around deployment.

  1. You'll need to create a Docker image. You likely want to add a tag to this as you will want to version the image. The DockerOperator defaults to the latest tag on an image.
  2. The image needs to be available to your deployed instance of Airflow. They can be built on the machine you're running Airflow on if you're wanting to run it locally. If you've deployed Airflow somewhere online, the more common practice would be to push them to a cloud service. There are a number of providers you can use (Docker Hub, Amazon ECR, etc...).

Expose an Entrypoint from your container to invoke and parameterize a task using the DockerOperator. ==> How do you do this?

If you have your image built, and is available to Airflow you simply need to create a task using the DockerOperator like so:

dag = DAG(**kwargs)
task_1 = DockerOperator(
    dag=dag,
    task_id='docker_task',
    image='dummyorg/dummy_api_tools:v1',
    auto_remove=True,
    docker_url='unix://var/run/docker.sock',
    command='python extract_from_api_or_something.py'
)

I'd recommend investing some time into understanding Docker. It's a little bit difficult to wrap your head around at first but it's a highly valuable tool, especially for systems like Airflow.

like image 55
dlachasse Avatar answered Oct 14 '22 13:10

dlachasse