Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Airflow: dag_id could not be found

Tags:

airflow

I'm running an airflow server and worker on different AWS machines. I've synced that dags folder between them, ran airflow initdb on both, and checked that the dag_id's are the same when I run airflow list_tasks <dag_id>

When I run the scheduler and worker, I get this error on the worker:

airflow.exceptions.AirflowException: dag_id could not be found: . Either the dag did not exist or it failed to parse. [...] Command ...--local -sd /home/ubuntu/airflow/dags/airflow_tutorial.py'

What seems to be the problem is that the path there is wrong (/home/ubuntu/airflow/dags/airflow_tutorial.py) since the correct path is /home/hadoop/...

On the server machine the path is with ubuntu, but on both config files it's simply ~/airflow/...

What makes the worker look in this path and not the correct one?

How do I tell it to look in it's own home dir?

edit:

  • It's unlikely a config problem. I've ran grep -R ubuntu and the only occurrences are in the logs
  • When I run the same on a computer with ubuntu as a user everything works. Which leads me to believe that for some reason airflow provides the worker with the full path of the task
like image 466
Dotan Avatar asked Apr 05 '17 15:04

Dotan


3 Answers

Adding --raw parameter to the airflow run command helped me to see what was the original exception. In my case, the metadata database instance was too slow, and loading dags failed because of a timeout. I've fixed it by:

  • Upgrading database instance
  • Increasing parameter dagbag_import_timeout in airflow.cfg

Hope this helps!

like image 110
Michael Spector Avatar answered Nov 17 '22 16:11

Michael Spector


I'm experiencing the same thing - the worker process appears to pass an --sd argument corresponding to the dags folder on the scheduler machine, not on the worker machine (even if dags_folder is set correctly in the airflow config file on the worker). In my case I was able to get things working by creating a symlink on the scheduler host such that dags_folder can be set to the same value. (In your example, this would mean creating a symlink /home/hadoop -> /home/ubuntu on the scheduler machine, and then settings dags_folder to /home/hadoop). So, this is not really an answer to the problem but it is a viable workaround in some cases.

like image 6
gcbenison Avatar answered Nov 17 '22 15:11

gcbenison


Have you tried setting the dags_folder parameter in config file to point explicitly to the /home/hadoop/ i.e. the desired path?

This parameter controls the location to look for dags

like image 2
Priyank Mehta Avatar answered Nov 17 '22 16:11

Priyank Mehta