I need to manage a large workflow of ETL tasks, which execution depends on time, data availability or an external event. Some jobs may fail during execution of the workflow and the system should have the ability to restart a failed workflow branch without waiting for whole workflow to finish execution.
Are there any frameworks in python that can handle this?
I see several core functions:
Something like Oozie, but more general purpose and in python.
The directed acyclic graph (DAG) is the central management node of each distributed job, that is, the application master (AM.) The AM is often called a DAG because it coordinates distributed job running. Job running in modern distributed systems can be described using DAG scheduling and data shuffling[1].
DAGs are useful for representing many different types of flows, including data processing flows.
Directed Acyclic Graphs (DAGs) A directed acyclic graph (or DAG) is a digraph that has no cycles. Example of a DAG: Theorem Every finite DAG has at least one source, and at least one sink. In fact, given any vertex v, there is a path from some source to v, and a path from v to some sink.
In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.
1) You can give dagobah a try, as described on its github page: Dagobah is a simple dependency-based job scheduler written in Python. Dagobah allows you to schedule periodic jobs using Cron syntax. Each job then kicks off a series of tasks (subprocesses) in an order defined by a dependency graph you can easily draw with click-and-drag in the web interface. This is the most lightweight scheduler project comparing with the three followings.
2) In terms of ETL tasks, luigi which is open sourced by Spotify focus more on hadoop jobs, as described: Luigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
Both of the two modules are mainly written in Python and web interfaces are included for convenient management.
As far as I know, 'luigi' doesn't provide a scheduler module for job tasks, which I think is necessary for ETL tasks. But using 'luigi' is more easy to write map-reduce code in Python and thousands of tasks every day at Spotify run depend on it.
3) Like luigi, Pinterest open sourced their a workflow manager named Pinball. Pinball’s architecture follows a master-worker (or master-client to avoid naming confusion with a special type of client that we introduce below) paradigm where the stateful central master acts as a source of truth about the current system state to stateless clients. And it integrate hadoop/hive/spark jobs smoothly.
4) Airflow, yet another dag job schedule project open sourced by Airbnb, is quite like Luigi and Pinball. The backend is build on Flask, Celery and so on. According to the example job code, Airflow is both powerful and easy to use by my side.
Last but not least, Luigi, Airflow and Pinball may be more widely used. And there is a great comparison among these three: http://bytepawn.com/luigi-airflow-pinball.html
There are a ton of these; everyone seems to write their own. There is a good list at https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems. Which includes systems that originate in both industry and academia.
Have you looked at Ruffus?
I have no experience with it but it appears to do some of the items on your list. It also looks quite hackable so you might be able to implement your other requirements yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With