Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Key of object type in the hadoop mapper

New to hadoop and trying to understand the mapreduce wordcount example code from here.

The mapper from documentation is -

Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

I see that in the mapreduce word count example the map code is as follows

public void map(Object key, Text value, Context context)

Question - What is the point of this key of type Object? If the input to a mapper is a text document I am assuming the value in would be the chunk of text (64MB or 128MB) that hadoop has partitioned and stored in HDFS. More generally, what is the use of this input key Keyin to the map code?

Any pointers would be greatly appreciated

like image 202
user275157 Avatar asked Mar 15 '15 17:03

user275157


People also ask

What is key in MapReduce?

Key-value pair in MapReduce is the record entity that Hadoop MapReduce accepts for execution. We use Hadoop mainly for data Analysis. It deals with structured, unstructured and semi-structured data. With Hadoop, if the schema is static we can directly work on the column instead of key value.

What is the type of mapper class?

Mapper is the base class which is used to implement the Map tasks in Hadoop MapReduce. Maps are the individual tasks which run before reducers and transforms the inputs into a set of output values. These output values are the intermediate values which act as the input to the Reduce task.

What is Mapper code in Hadoop?

Mapper is the first code which is responsible to migrate/ manipulate the HDFS block stored data into key and value pair. Hadoop assign one map program to individually one blocks i.e. if my data is on 20 blocks then 20 map program will run parallel and the mapper output will getting store on local disk.

What does the mapper map input key, value pairs?

Maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.


1 Answers

InputFormat describes the input-specification for a Map-Reduce job.By default, hadoop uses TextInputFormat, which inherits FileInputFormat, to process the input files.

We can also specify the input format to use in the client or driver code:

job.setInputFormatClass(SomeInputFormat.class);

For the TextInputFormat, files are broken into lines. Keys are the position in the file, and values are the line of text.

In the public void map(Object key, Text value, Context context) , key is the line offset and value is the actual text.

Please look at TextInputFormat API https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html

By default, Key is LongWritable type and value is of type Text for the TextInputFormat.In your example, Object type is specified in the place of LongWritable as it is compatible. You can also use LongWritable type in the place of Object

like image 157
Ramana Avatar answered Oct 09 '22 12:10

Ramana