I created a simple Oozie work flow with Sqoop, Hive and Pig actions. For each of there actions, Oozie launches a MR launcher and which in turn launches the action (Sqoop/Hive/Pig). So, there are a total of 6 MR jobs for 3 actions in the work flow.
Why does Oozie start an MR launcher to start the action and not directly start the action?
Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs. Oozie is a Java Web-Application that runs in a Java servlet-container.
The Oozie Coordinator system allows the user to define and execute recurrent and interdependent workflow jobs (data application pipelines). Real world data application pipelines have to account for reprocessing, late processing, catchup, partial processing, monitoring, notification and SLAs.
Apache Oozie is a tool for Hadoop operations that allows cluster administrators to build complex data transformations out of multiple component tasks. This provides greater control over jobs and also makes it easier to repeat those jobs at predetermined intervals.
Workflow in Oozie is a sequence of actions arranged in a control dependency DAG (Direct Acyclic Graph). The actions are in controlled dependency as the next action can only run as per the output of current action. Subsequent actions are dependent on its previous action.
I posted the same in Apache Flume forums and here is the response.
It's also to keep the Oozie server from being bogged down or becoming unstable. For example, if you have a bunch of workflows running Pig jobs, then you'd have the Oozie server running multiple copies of the Pig client (which is a relatively "heavy" program) directly. By moving all of the user code and external clients to map tasks in the launcher job, the Oozie server remains more light-weight and less prone to errors. It can also much more scalable this way because the launcher jobs distribute the the job launching/monitoring to other machines in the cluster; otherwise, with the Oozie server doing everything, we'd have to limit the number of concurrent workflows based on your Oozie server's machine specs (RAM, CPU, etc). And finally, from an architectural standpoint, the Oozie server itself is stateless; that is, everything is stored in the database and the Oozie server can be taken down at any point without losing anything. If we were to launch jobs directly from the Oozie server, then we'd now have some state (e.g. the Pig client cannot be restarted and resumed).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With