Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DAG(directed acyclic graph) dynamic job scheduler

I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.

Are there any frameworks in python that can handle this?

I see several core functions:

  • DAG bulding
  • Execution of nodes (run shell cmd with wait,logging etc.)
  • Ability to rebuild sub-graph in parent DAG during execution
  • Ability to manual execute nodes or sub-graph while parent graph is running
  • Suspend graph execution while waiting for external event
  • List job queue and job details

Something like Oozie, but more general purpose and in python.

like image 855
Alexandr Mazanov Avatar asked Jan 12 '13 10:01

Alexandr Mazanov


People also ask

What is a DAG job?

The directed acyclic graph (DAG) is the central management node of each distributed job, that is, the application master (AM.) The AM is often called a DAG because it coordinates distributed job running. Job running in modern distributed systems can be described using DAG scheduling and data shuffling[1].

What is a DAG good for?

DAGs are useful for representing many different types of flows, including data processing flows.

What is DAG directed acyclic graph give an example?

Directed Acyclic Graphs (DAGs) A directed acyclic graph (or DAG) is a digraph that has no cycles. Example of a DAG: Theorem Every finite DAG has at least one source, and at least one sink. In fact, given any vertex v, there is a path from some source to v, and a path from v to some sink.

What is the use of DAG in Airflow?

In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.


3 Answers

1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.

dagobah's web interface

2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.

luigi's web interface

Both of the two modules are mainly written in Python and web interfaces are included for convenient management.

As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.

3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.

pinball's web interface

4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.

airflow's web interface

Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html

like image 190
zfz Avatar answered Oct 01 '22 04:10

zfz


There are a ton of these; everyone seems to write their own. There is a good list at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems. Which includes systems that originate in both industry and academia.

like image 29
Dunk Avatar answered Oct 01 '22 05:10

Dunk


Have you looked at Ruffus?

I have no experience with it but it appears to do some of the items on your list. It also looks quite hackable so you might be able to implement your other requirements yourself.

like image 26
snth Avatar answered Oct 01 '22 03:10

snth