Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop configuration: mapred.* vs mapreduce.*

I noticed that there are two sets of Hadoop configuration parameters: one with mapred.* and the other with mapreduce.. I am guessing these might be due to old API vs. new API but if I am not mistaken, these seem to coexist in the new API. Am I correct? If so, is there a generalized statement what is used for mapred. and what is for mapreduce.*?

like image 583
kee Avatar asked Jun 11 '12 19:06

kee


People also ask

What is Mapred in Hadoop?

MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of servers in a Hadoop cluster. As the processing component, MapReduce is the heart of Apache Hadoop. The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs perform.

How does Hadoop MapReduce work?

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application.

What is mapper and reducer in hive?

In Hadoop, Reducer takes the output of the Mapper (intermediate key-value pair) process each of them to generate the output. The output of the reducer is the final output, which is stored in HDFS. Usually, in the Hadoop Reducer, we do aggregation or summation sort of computation.


1 Answers

Examining the source for 0.20.2, there are only a few mapreduce.* properties, and they revolve around configuring the job input/output format, mapper/combiner/reducer and partitioner classes (they also signal to the job client that the new API is being used by the user - look through the source for o.a.h.mapreduce.Job, setUseNewAPI() method)

  • mapreduce.inputformat.class
  • mapreduce.outputformat.class
  • mapreduce.partitioner.class
  • mapreduce.map.class
  • mapreduce.combine.class
  • mapreduce.reduce.class

There are some more properties but they are secondary configuration

The input and output formats, whether it be new or old API versions, typically use mapred.* properties

For example, the signal your map reduce input paths you use mapred.input.dir (whether you're using the new or old API). Same for the output property mapred.output.dir

So the long and the short of if is, if there isn't a utility method to configure the property (FileInputFormat.setInputPaths(Job, String)) then you'll need to check the source

like image 115
Chris White Avatar answered Nov 13 '22 00:11

Chris White