Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.

All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.

  • When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket enter image description here

  • In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself. enter image description here

  • Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.

  • I usually just clear the task instance and it run successfully on the first try.

Has anyone encountered a similar problem?

like image 539
Ary Jazz Avatar asked Jan 21 '19 09:01

Ary Jazz


People also ask

How do you check Airflow DAG logs?

You can also view the logs in the Airflow web interface. Streaming logs: These logs are a superset of the logs in Airflow. To access streaming logs, you can go to the logs tab of Environment details page in Google Cloud console, use the Cloud Logging, or use Cloud Monitoring. Logging and Monitoring quotas apply.

What logs look for Airflow cluster startup issues?

The service logs are available at /media/ephemeral0/logs/airflow location inside the cluster node. Since airflow is single node machine, logs are accessible on the same node. These logs are helpful in troubleshooting cluster bringup and scheduling issues.

How do I debug DAG Airflow?

Run airflow dags list with the Airflow CLI to make sure that Airflow has registered the DAG in the metastore. If the DAG appears in the list, try restarting the webserver. Try restarting the scheduler (if you are using the Astro CLI, run astro dev stop && astro dev start ).


1 Answers

Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").

You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.

Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.

EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

like image 137
Iñigo Avatar answered Sep 20 '22 07:09

Iñigo