Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How wordCount mapReduce jobs, run on hadoop yarn cluster with apache tez?

As the github page of tez says, tez is very simple and at its heart has just two components:

  1. The data-processing pipeline engine, and

  2. A master for the data-processing application, where-by one can put together arbitrary data-processing 'tasks' described above into a task-DAG

Well my first question is, How existing mapreduce jobs like wordcount that exists in tez-examples.jar, converted to task-DAG? where? or they don't...?

and my second and more important question is about this part:

Every 'task' in tez has the following:

  1. Input to consume key/value pairs from.
  2. Processor to process them.
  3. Output to collect the processed key/value pairs.

Who is in charge of splitting input data between the tez-tasks? Is it the code that user provide or is it Yarn (the resource manager) or even the tez itself?

The question is the same for output phase. Thanks in advance

like image 791
AmirSojoodi Avatar asked Jul 17 '15 00:07

AmirSojoodi


Video Answer


1 Answers

To answer your first question on converting MapReduce jobs to Tez DAGs:

Any MapReduce job can be thought of a single DAG with 2 vertices(stages). The first vertex is the Map phase and it is connected to a downstream vertex Reduce via a Shuffle edge.

There are 2 ways in which MR jobs can be run on Tez:

  1. One approach is to write a native 2-stage DAG using the Tez APIs directly. This is what is currently present in tez-examples.
  2. The second is to use the MapReduce APIs themselves and use the yarn-tez mode. In this scenario, there is a layer which intercepts the MR Job submission and instead of using MR, it translates the MR job into a 2-stage Tez DAG and executes the DAG on the Tez runtime.

For the data handling related questions that you have:

The user provides the logic on understanding the data to be read and how to split it. Tez then takes each split of data and takes over the responsibility of assigning a split or a set of splits to a given task.

The Tez framework then controls the generation and movement of data i.e. where to generate the data between intermediate steps and how to move data between 2 vertices/stages. However, it does not control the underlying data contents/structure, partitioning or serialization logic which is provided by user plugins.

The above is just a high level view with additional intricacies. You will get more detailed answers by posting specific questions to the Development list ( http://tez.apache.org/mail-lists.html )

like image 67
hadoop_user Avatar answered Oct 26 '22 05:10

hadoop_user