I was trying to see what makes Apache Tez with Hive much faster than map reduce with hive. I am not able to understand DAG concept. Anyone have a good reference for understanding the architecture of Apache TEZ.

I am not yet using Tez but I have read about it. I think the main two reasons that will make Hive to run faster over Tez are: <ol> <li>Tez will share data between Map Reduce jobs in memory when possible, avoiding the overhead of writing/ reading to/ from HDFS</li> <li>With Tez you can run multiple map/ reduce DAGs defined on Hive, in one Tez session without needing to start a new application master each time.</li> </ol> You can find a list of links that will help you to understand Tez better here: http://hortonworks.com/hadoop/tez/

Apache Tez architecture Explanation

Tags:

hadoop

hive

I was trying to see what makes Apache Tez with Hive much faster than map reduce with hive. I am not able to understand DAG concept.
Anyone have a good reference for understanding the architecture of Apache TEZ.

551

asked Aug 27 '14 07:08

hjamali52

3 Answers

The presentation from Hadoop Summit (slide 35) discussed how the DAG approach is optimal vs MapReduce paradigm:

http://www.slideshare.net/Hadoop_Summit/murhty-saha-june26255pmroom212

Essentially it will allow higher level tools (like Hive and Pig) to define their overall processing steps (aka workflow, aka Directed Acyclical Graph) before the job begins. A DAG is a graph of all the steps needed to complete the job (hive query, Pig job, etc.). Because the entire job's steps can be computed before execution time, the system can take advantage of caching intermediate job results "in memory". Whereas, in MapReduce all intermediate data between MapReduce phases required writing to HDFS (disk) adding latency.

YARN also allows container reuse for Tez tasks. E.g. each server is chopped into multiple "containers" rather than "map" or "reduce" slots. For any given point in the job execution this allows Tez to use the entire cluster for the map phases or the reduce phases as needed. Whereas in Hadoop v1 prior to YARN, the number of map slots (and reduce slots) were fixed/hard coded at the platform level. Better utilization of all available cluster resources generally leads to faster

answered Sep 23 '22 14:09

Wes Floyd

Apache Tez represents an alternative to the traditional MapReduce that allows for jobs to meet demands for fast response times and extreme throughput at petabyte scale.

Higher-level data processing applications like Hive and Pig need an execution framework that can express their complex query logic in an efficient manner and then execute it with high performance which is managed by Tez. Tez achieves this goal by modeling data processing not as a single job, but rather as a data flow graph.

… with vertices in the graph representing application logic and edges representing movement of data. A rich dataflow definition API allows users to express complex query logic in an intuitive manner and it is a natural fit for query plans produced by higher-level declarative applications like Hive and Pig... [The] dataflow pipeline can be expressed as a single Tez job that will run the entire computation. Expanding this logical graph into a physical graph of tasks and executing it is taken care of by Tez.

Data Processing API in Apache Tez blog post describes a simple Java API used to express a DAG of data processing. The API has three components

•DAG. this defines the overall job. The user creates a DAG object for each data processing job.

•Vertex. this defines the user logic and the resources & environment needed to execute the user logic. The user creates a Vertex object for each step in the job and adds it to the DAG.

•Edge. this defines the connection between producer and consumer vertices. The user creates an Edge object and connects the producer and consumer vertices using it.

Edge properties defined by Tez enable it to instantiate user tasks, configure their inputs and outputs, schedule them appropriately and define how to route data between the tasks. Tez also allows to define parallelism for each vertex execution by specifying user guidance, data size and resources.

Data movement: Defines routing of data between tasks ◦One-To-One: Data from the ith producer task routes to the ith consumer task.

Broadcast: Data from a producer task routes to all consumer tasks.

Scatter-Gather: Producer tasks scatter data into shards and consumer tasks gather the shards. The ith shard from all producer tasks routes to the ith consumer task.

Scheduling. Defines when a consumer task is scheduled ◦Sequential: Consumer task may be scheduled after a producer task completes. Concurrent: Consumer task must be co-scheduled with a producer task.

Data source: Defines the lifetime/reliability of a task output ◦Persisted: Output will be available after the task exits. Output may be lost later on. Persisted-Reliable: Output is reliably stored and will always be available Ephemeral: Output is available only while the producer task is running.

Additional details on Tez architecture are presented in this Apache Tez Design Doc.

answered Sep 22 '22 14:09

Abhijeet Dhumal

I am not yet using Tez but I have read about it. I think the main two reasons that will make Hive to run faster over Tez are:

Tez will share data between Map Reduce jobs in memory when possible, avoiding the overhead of writing/ reading to/ from HDFS
With Tez you can run multiple map/ reduce DAGs defined on Hive, in one Tez session without needing to start a new application master each time.

You can find a list of links that will help you to understand Tez better here: http://hortonworks.com/hadoop/tez/

answered Sep 24 '22 14:09

Geeky

Related questions
                            
                                InstantiationException in hadoop map reduce program
                            
                                Sample data for Hadoop [duplicate]
                            
                                Relationship between Hadoop and databases
                            
                                How to flatten a group into a single tuple in Pig?
                            
                                Spark in AWS: "S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream"
                            
                                Hadoop cluster. 2 Fast, 4 Medium, 8 slower machines?
                            
                                how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar
                            
                                hadoop pagerank error when running
                            
                                Sequence Files in Hadoop
                            
                                Programmatically reading the output of Hadoop Mapreduce Program
                            
                                What difference of RDBMS and Hive? [closed]
                            
                                Difference failed tasks vs killed tasks
                            
                                Skewed tables in Hive
                            
                                package org.apache.hadoop.fs does not exist
                            
                                Container is running beyond virtual memory limits
                            
                                MapReduce Job not showing my print statements on the terminal
                            
                                ERROR jdbc.HiveConnection: Error opening session Hive
                            
                                Converting Unix epoch time to extended ISO8601
                            
                                Hadoop/Hive - Split a single row into multiple rows
                            
                                hadoop java.net.URISyntaxException: Relative path in absolute URI: rsrc:hbase-common-0.98.1-hadoop2.jar

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With