As the github page of tez says, tez is very simple and at its heart has just two components:
The data-processing pipeline engine, and
A master for the data-processing application, where-by one can put together arbitrary data-processing 'tasks' described above into a task-DAG
Well my first question is, How existing mapreduce jobs like wordcount that exists in tez-examples.jar, converted to task-DAG? where? or they don't...?
and my second and more important question is about this part:
Every 'task' in tez has the following:
Who is in charge of splitting input data between the tez-tasks? Is it the code that user provide or is it Yarn (the resource manager) or even the tez itself?
The question is the same for output phase. Thanks in advance
To answer your first question on converting MapReduce jobs to Tez DAGs:
Any MapReduce job can be thought of a single DAG with 2 vertices(stages). The first vertex is the Map phase and it is connected to a downstream vertex Reduce via a Shuffle edge.
There are 2 ways in which MR jobs can be run on Tez:
For the data handling related questions that you have:
The user provides the logic on understanding the data to be read and how to split it. Tez then takes each split of data and takes over the responsibility of assigning a split or a set of splits to a given task.
The Tez framework then controls the generation and movement of data i.e. where to generate the data between intermediate steps and how to move data between 2 vertices/stages. However, it does not control the underlying data contents/structure, partitioning or serialization logic which is provided by user plugins.
The above is just a high level view with additional intricacies. You will get more detailed answers by posting specific questions to the Development list ( http://tez.apache.org/mail-lists.html )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With