Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is significance of the Oozie MR launcher?

I created a simple Oozie work flow with Sqoop, Hive and Pig actions. For each of there actions, Oozie launches a MR launcher and which in turn launches the action (Sqoop/Hive/Pig). So, there are a total of 6 MR jobs for 3 actions in the work flow.

Why does Oozie start an MR launcher to start the action and not directly start the action?

like image 684
Praveen Sripati Avatar asked Oct 21 '13 07:10

Praveen Sripati


People also ask

What is Oozie launcher?

Oozie is a server based Workflow Engine specialized in running workflow jobs with actions that run Hadoop Map/Reduce and Pig jobs. Oozie is a Java Web-Application that runs in a Java servlet-container.

What is the purpose of the Oozie coordinator?

The Oozie Coordinator system allows the user to define and execute recurrent and interdependent workflow jobs (data application pipelines). Real world data application pipelines have to account for reprocessing, late processing, catchup, partial processing, monitoring, notification and SLAs.

What does Oozie do in Hadoop?

Apache Oozie is a tool for Hadoop operations that allows cluster administrators to build complex data transformations out of multiple component tasks. This provides greater control over jobs and also makes it easier to repeat those jobs at predetermined intervals.

How are Oozie workflows defined?

Workflow in Oozie is a sequence of actions arranged in a control dependency DAG (Direct Acyclic Graph). The actions are in controlled dependency as the next action can only run as per the output of current action. Subsequent actions are dependent on its previous action.


1 Answers

I posted the same in Apache Flume forums and here is the response.

It's also to keep the Oozie server from being bogged down or becoming unstable. For example, if you have a bunch of workflows running Pig jobs, then you'd have the Oozie server running multiple copies of the Pig client (which is a relatively "heavy" program) directly. By moving all of the user code and external clients to map tasks in the launcher job, the Oozie server remains more light-weight and less prone to errors. It can also much more scalable this way because the launcher jobs distribute the the job launching/monitoring to other machines in the cluster; otherwise, with the Oozie server doing everything, we'd have to limit the number of concurrent workflows based on your Oozie server's machine specs (RAM, CPU, etc). And finally, from an architectural standpoint, the Oozie server itself is stateless; that is, everything is stored in the database and the Oozie server can be taken down at any point without losing anything. If we were to launch jobs directly from the Oozie server, then we'd now have some state (e.g. the Pig client cannot be restarted and resumed).

like image 165
Praveen Sripati Avatar answered Oct 16 '22 07:10

Praveen Sripati