I was trying to see what makes Apache Tez with Hive much faster than map reduce with hive. 
I am not able to understand DAG concept.
Anyone have a good reference for understanding the architecture of Apache TEZ.
Apache Tez is a framework for creating a complex directed acyclic graph (DAG) of tasks for processing data. In some cases, it is used as an alternative to Hadoop MapReduce. For example, Pig and Hive workflows can run using Hadoop MapReduce or they can use Tez as an execution engine.
Though, Spark can't run concurrently with other YARN applications (at least not yet). Gopal V, one of the developers for the Tez project, wrote an extensive post about why he likes Tez.
Architecture of Hive Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server).
Apache Tez completed the whole script as a one job while MapReduce divides it into two jobs. Due to two jobs MapReduce takes more time than Apache Tez. Hadoop memory containers firstly allocated for one job and second job got chance only after completion of the first job.
The presentation from Hadoop Summit (slide 35) discussed how the DAG approach is optimal vs MapReduce paradigm:
http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212
Essentially it will allow higher level tools (like Hive and Pig) to define their overall processing steps (aka workflow, aka Directed Acyclical Graph) before the job begins. A DAG is a graph of all the steps needed to complete the job (hive query, Pig job, etc.). Because the entire job's steps can be computed before execution time, the system can take advantage of caching intermediate job results "in memory". Whereas, in MapReduce all intermediate data between MapReduce phases required writing to HDFS (disk) adding latency.
YARN also allows container reuse for Tez tasks. E.g. each server is chopped into multiple "containers" rather than "map" or "reduce" slots. For any given point in the job execution this allows Tez to use the entire cluster for the map phases or the reduce phases as needed. Whereas in Hadoop v1 prior to YARN, the number of map slots (and reduce slots) were fixed/hard coded at the platform level. Better utilization of all available cluster resources generally leads to faster
Apache Tez represents an alternative to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale.
Higher-level data processing applications like Hive and Pig need an execution framework that can express their complex query logic in an efficient manner and then execute it with high performance which is managed by Tez. Tez achieves this goal by modeling data processing not as a single job, but rather as a data flow graph.
… with vertices in the graph representing application logic and edges representing movement of data. A rich dataflow definition API allows users to express complex query logic in an intuitive manner and it is a natural fit for query plans produced by higher-level declarative applications like Hive and Pig... [The] dataflow pipeline can be expressed as a single Tez job that will run the entire computation. Expanding this logical graph into a physical graph of tasks and executing it is taken care of by Tez.
Data Processing API in Apache Tez blog post describes a simple Java API used to express a DAG of data processing. The API has three components
•DAG. this defines the overall job. The user creates a DAG object for each data processing job.
•Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.
•Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.
Edge properties defined by Tez enable it to instantiate user tasks, configure their inputs and outputs, schedule them appropriately and define how to route data between the tasks. Tez also allows to define parallelism for each vertex execution by specifying user guidance, data size and resources.
Data movement: Defines routing of data between tasks ◦One-To-One: Data from the ith producer task routes to the ith consumer task.
Broadcast: Data from a producer task routes to all consumer tasks.
Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ith consumer task.
Scheduling. Defines when a consumer task is scheduled ◦Sequential: Consumer task may be scheduled after a producer task completes. Concurrent: Consumer task must be co-scheduled with a producer task.
Data source: Defines the lifetime/reliability of a task output ◦Persisted: Output will be available after the task exits. Output may be lost later on. Persisted-Reliable: Output is reliably stored and will always be available Ephemeral: Output is available only while the producer task is running.
Additional details on Tez architecture are presented in this Apache Tez Design Doc.
I am not yet using Tez but I have read about it. I think the main two reasons that will make Hive to run faster over Tez are:
You can find a list of links that will help you to understand Tez better here: http://hortonworks.com/hadoop/tez/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With