New to hadoop and trying to understand the mapreduce wordcount example code from here.
The mapper from documentation is -
Mapper<KEYIN,VALUEIN,KEYOUT,VALUEOUT>
I see that in the mapreduce word count example the map code is as follows
public void map(Object key, Text value, Context context)
Question - What is the point of this key of type Object? If the input to a mapper is a text document I am assuming the value in would be the chunk of text (64MB or 128MB) that hadoop has partitioned and stored in HDFS. More generally, what is the use of this input key Keyin to the map code?
Any pointers would be greatly appreciated
Key-value pair in MapReduce is the record entity that Hadoop MapReduce accepts for execution. We use Hadoop mainly for data Analysis. It deals with structured, unstructured and semi-structured data. With Hadoop, if the schema is static we can directly work on the column instead of key value.
Mapper is the base class which is used to implement the Map tasks in Hadoop MapReduce. Maps are the individual tasks which run before reducers and transforms the inputs into a set of output values. These output values are the intermediate values which act as the input to the Reduce task.
Mapper is the first code which is responsible to migrate/ manipulate the HDFS block stored data into key and value pair. Hadoop assign one map program to individually one blocks i.e. if my data is on 20 blocks then 20 map program will run parallel and the mapper output will getting store on local disk.
Maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks which transform input records into a intermediate records. The transformed intermediate records need not be of the same type as the input records. A given input pair may map to zero or many output pairs.
InputFormat describes the input-specification for a Map-Reduce job.By default, hadoop uses TextInputFormat
, which inherits FileInputFormat
, to process the input files.
We can also specify the input format to use in the client or driver code:
job.setInputFormatClass(SomeInputFormat.class);
For the TextInputFormat
, files are broken into lines. Keys are the position in the file, and values are the line of text.
In the public void map(Object key, Text value, Context context)
, key is the line offset and value is the actual text.
Please look at TextInputFormat API https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/TextInputFormat.html
By default, Key is LongWritable
type and value is of type Text
for the TextInputFormat
.In your example, Object type is specified in the place of LongWritable
as it is compatible. You can also use LongWritable
type in the place of Object
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With