Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you explain when and why mapreduce is invoked in hive

Tags:

hive

hiveql

  1. select * from Table_name limit 5;

  2. select col1_name,col2_name from table_name limit 5;

When i run the first query there will be no MapReduce invoked, while for other the MapReduce is invoked. Could you please explain the reason.

like image 261
channabasava sajjan Avatar asked Jun 18 '15 06:06

channabasava sajjan


3 Answers

Take the simple hive query below:

Describe table;

This reads data from the hive metastore and is the simplist and fastest query in hive.

select * from table;

This query needs only read data from HDFS. So far neither requires any map or reduce phases.

select * from table where color in ('RED','WHITE','BLUE')

This query requires only a map only, there is no reduce phase. There is no aggregation function of any kind. Here we are filtering to collect records that are RED, WHITE, or BLUE.

select count(1) from table;

This query requires only a reduce phase. No mapping required because we are counting all the records in the table. If we want to count across elements then we will be adding a map phase prior to the reduce phase. See below:

Select color
, count(1) as color_count 
  from table  
  group by color;

This query has an aggregation function and a group by statement. We are counting the number of elements in the table that are RED, WHITE, or BLUE. This counting requires a map and a reduce job.

Essentially we create a key value pair in the above job. We map records to a key. In this case it will be RED, WHITE, and BLUE. Then a value of one is made. So the key:value is color:1. Then we can sum the value across the key color. This is a map and reduce job.

Now take the same query and an order by clause.

Select color
, count(1) as color_count 
  from table  
  group by color
  order by colour_count desc;

This adds another reduce phase and forces a single reducer for the data set to passed through. This is necessary because we want to ensure that global ordering is maintained. Count(distinct color) also forces a single reducer and requires a map and reduce phase.

As you add complexity to your hive query you in a similar fashion add map and reduce jobs required to obtain the requested results.

If you want to find out how hive will manage a query you can use the explain caluse in front of your query.

 Explain select * from table;

This can give you an idea of how the query is being executed under the hood. It will show you dependencies of stages and to what if any aggregations are resulting in reduce jobs and operators are resulting in map jobs.

like image 69
invoketheshell Avatar answered Oct 11 '22 08:10

invoketheshell


To understand the reason, first we need to know what map and reduce phases mean:-

  1. Map: Basically a filter which filters and organizes data in sorted order. For e.g. It will filter col1_name, col2_name from a row in the second query. However in 1st query you are reading every column, no filtering is required. Hence no Map phase

  2. Reduce: Reduce is just summary operation data across the rows. for e.g. sum of a coloumn! In both the queries you don't need any summary data. Hence no reducer.

so, 1st query as no map-reduce, 2nd query has only mappers but no reduces.

like image 37
Mangat Rai Modi Avatar answered Oct 11 '22 10:10

Mangat Rai Modi


Its logical.

In first query ,only thing to be done is --dump the data with limit of 5 (which means take any 5 numbers of rows to be dumped ).Nothing to be done with processing with specific type of query. (other than knowing how rows are seperated);

but in second query a map - reduce job is to be there . why ?? because first it has to process the data to know how many different columns are .than to know whether col1 and col1 really exists or there is only one col in it . if exists than it has to eliminate other columns first and than in remaining columns it has to take only five rows in it

like image 22
Ankit Agrahari Avatar answered Oct 11 '22 09:10

Ankit Agrahari