Normally, we write the mapper in the form : <pre class="prettyprint"><code>public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable> </code></pre> Here the input key-value pair for the mapper is <code><LongWritable, Text></code> - as far as I know when the mapper gets the input data its goes through line by line - so the Key for the mapper signifies the line number - please correct me if I am wrong. My question is : If I give the input key-value pair for mapper as <code><Text, Text></code> then it is giving the error <pre class="prettyprint"><code> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text </code></pre> Is it a mandatory to give the input key-value pair of mapper as <code><LongWritable, Text></code> - if yes then why ? if no then what the reason of the error ? Can you please help me understand the proper reasoning of the error ? Thanks in advance.

The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends <code>FileInputFormat<LongWritable, Text></code>. If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <code><LongWritable, Text></code> will cause this error. If you expect <code><Text, Text></code> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup: <pre class="prettyprint"><code>job.setInputFormatClass(MyInputFormat.class); </code></pre> And like I said, by default this is set to TextInputFormat. Now, let's say your input data is a bunch of newline-separated records delimited by a comma: <ul> <li>"A,value1" </li> <li>"B,value2"</li> </ul> If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <code><Text, Text></code> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well. In short, add a class which extends <code>FileInputFormat<Text, Text></code> and a class which extends <code>RecordReader<Text, Text></code>. Override the <code>FileInputFormat#getRecordReader</code> method, and have it return an instance of your custom RecordReader. Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling <code>LineRecordReader#getCurrentValue</code> and splitting it on comma. Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.

Mapper input Key-Value pair in Hadoop

Tags:

key-value

hadoop

mapreduce

Normally, we write the mapper in the form :

public static class Map extends Mapper<**LongWritable**, Text, Text, IntWritable>

Here the input key-value pair for the mapper is <LongWritable, Text> - as far as I know when the mapper gets the input data its goes through line by line - so the Key for the mapper signifies the line number - please correct me if I am wrong.

My question is : If I give the input key-value pair for mapper as <Text, Text> then it is giving the error

 java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text

Is it a mandatory to give the input key-value pair of mapper as <LongWritable, Text> - if yes then why ? if no then what the reason of the error ? Can you please help me understand the proper reasoning of the error ?

Thanks in advance.

232

asked Oct 27 '13 22:10

Ronin

2 Answers

The input to the mapper depends on what InputFormat is used. The InputFormat is responsible for reading the incoming data and shaping it into whatever format the Mapper expects.The default InputFormat is TextInputFormat, which extends FileInputFormat<LongWritable, Text>.

If you do not change the InputFormat, using a Mapper with different Key-Value type signature than <LongWritable, Text> will cause this error. If you expect <Text, Text> input, you will have to choose an appropiate InputFormat. You can set the InputFormat in Job setup:

job.setInputFormatClass(MyInputFormat.class);

And like I said, by default this is set to TextInputFormat.

Now, let's say your input data is a bunch of newline-separated records delimited by a comma:

"A,value1"
"B,value2"

If you want the input key to the mapper to be ("A", "value1"), ("B", "value2") you will have to implement a custom InputFormat and RecordReader with the <Text, Text> signature. Fortunately, this is pretty easy. There is an example here and probably a few examples floating around StackOverflow as well.

In short, add a class which extends FileInputFormat<Text, Text> and a class which extends RecordReader<Text, Text>. Override the FileInputFormat#getRecordReader method, and have it return an instance of your custom RecordReader.

Then you will have to implement the required RecordReader logic. The simplest way to do this is to create an instance of LineRecordReader in your custom RecordReader, and delegate all basic responsibilities to this instance. In the getCurrentKey and getCurrentValue-methods you will implement the logic for extracting the comma delimited Text contents by calling LineRecordReader#getCurrentValue and splitting it on comma.

Finally, set your new InputFormat as Job InputFormat as shown after the second paragraph above.

112

answered Sep 29 '22 16:09

Alex A.

In the book "Hadoop: The Difinitive Guide" by Tom White I think he has an appropriate answer to this(pg. 197):

"TextInputFormat’s keys, being simply the offset within the file, are not normally very useful. It is common for each line in a file to be a key-value pair, separated by a delimiter such as a tab character. For example, this is the output produced by TextOutputFormat, Hadoop’s default OutputFormat. To interpret such files correctly, KeyValueTextInputFormat is appropriate.

You can specify the separator via the key.value.separator.in.input.line property. It is a tab character by default."

answered Sep 29 '22 17:09

canada11

Related questions
                            
                                What is the path to directory within Hadoop filesystem?
                            
                                Streaming data and Hadoop? (not Hadoop Streaming)
                            
                                Output a list from a Hadoop Map Reduce job using custom writable
                            
                                Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName
                            
                                Hadoop HDFS copy with wildcards?
                            
                                Hive error: parseexception missing EOF
                            
                                Default number of reducers
                            
                                A starting point for learning how to implement MapReduce/Hadoop in Python?
                            
                                reuse JVM in Hadoop mapreduce jobs
                            
                                Why the sshd service is unrecognized?
                            
                                Is there any official Docker images for Hadoop?
                            
                                Can i point multiple location to same hive external table?
                            
                                HBase Error - assignment of -ROOT- failure
                            
                                Hadoop: how to access (many) photo images to be processed by map/reduce?
                            
                                To change replication factor of a directory in hadoop
                            
                                Checksum verification in Hadoop
                            
                                copyFromLocal: unexpected URISyntaxException
                            
                                Apache Hive How to round off to 2 decimal places?
                            
                                Spark 1.6-Failed to locate the winutils binary in the hadoop binary path
                            
                                How to get file size

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With