select * from Table_name limit 5;
select col1_name,col2_name from table_name limit 5;
When i run the first query there will be no MapReduce invoked, while for other the MapReduce is invoked. Could you please explain the reason.
Take the simple hive query below:
Describe table;
This reads data from the hive metastore and is the simplist and fastest query in hive.
select * from table;
This query needs only read data from HDFS. So far neither requires any map or reduce phases.
select * from table where color in ('RED','WHITE','BLUE')
This query requires only a map only, there is no reduce phase. There is no aggregation function of any kind. Here we are filtering to collect records that are RED, WHITE, or BLUE.
select count(1) from table;
This query requires only a reduce phase. No mapping required because we are counting all the records in the table. If we want to count across elements then we will be adding a map phase prior to the reduce phase. See below:
Select color
, count(1) as color_count
from table
group by color;
This query has an aggregation function and a group by statement. We are counting the number of elements in the table that are RED, WHITE, or BLUE. This counting requires a map and a reduce job.
Essentially we create a key value pair in the above job. We map records to a key. In this case it will be RED, WHITE, and BLUE. Then a value of one is made. So the key:value is color:1. Then we can sum the value across the key color. This is a map and reduce job.
Now take the same query and an order by clause.
Select color
, count(1) as color_count
from table
group by color
order by colour_count desc;
This adds another reduce phase and forces a single reducer for the data set to passed through. This is necessary because we want to ensure that global ordering is maintained. Count(distinct color) also forces a single reducer and requires a map and reduce phase.
As you add complexity to your hive query you in a similar fashion add map and reduce jobs required to obtain the requested results.
If you want to find out how hive will manage a query you can use the explain caluse in front of your query.
Explain select * from table;
This can give you an idea of how the query is being executed under the hood. It will show you dependencies of stages and to what if any aggregations are resulting in reduce jobs and operators are resulting in map jobs.
To understand the reason, first we need to know what map and reduce phases mean:-
Map: Basically a filter which filters and organizes data in sorted order. For e.g. It will filter col1_name, col2_name from a row in the second query. However in 1st query you are reading every column, no filtering is required. Hence no Map phase
Reduce: Reduce is just summary operation data across the rows. for e.g. sum of a coloumn! In both the queries you don't need any summary data. Hence no reducer.
so, 1st query as no map-reduce, 2nd query has only mappers but no reduces.
Its logical.
In first query ,only thing to be done is --dump the data with limit of 5 (which means take any 5 numbers of rows to be dumped ).Nothing to be done with processing with specific type of query. (other than knowing how rows are seperated);
but in second query a map - reduce job is to be there . why ?? because first it has to process the data to know how many different columns are .than to know whether col1 and col1 really exists or there is only one col in it . if exists than it has to eliminate other columns first and than in remaining columns it has to take only five rows in it
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With