Hadoop: How does OutputCollector work during MapReduce?

Tags:

I want to know if the OutputCollector's 'instance' output used in the map function: output.collect(key, value) this -output- be storing the key value pairs somewhere? even if it emits to the reducer function, their must be an intermediate file, right? What are those files? Are they visible and decided by the programer? Are the OutputKeyClass, and OutputValueClasses which we specify in the main function these places of storage? [Text.class and IntWritable.class]

Im giving the standard code for Word Count example in MapReduce, which we can find at many places in the net.

public class WordCount {

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);

conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));    
JobClient.runJob(conf);
}
}

228

asked Jun 12 '12 12:06

catty

1 Answers

The output from the Map function is stored in Temporary Intermediate Files. These files are handled transparently by Hadoop, so in a normal scenario, the programmer doesn't have access to that. If you're curious about what's happening inside each mapper, you can review the logs for the respective job where you'll find a log file for each map task.

If you want to control where the temporary files are generated, and have access to them, you have to create your own OutputCollector class, and I don't know how easy that is.

If you want to have a look at the source code, you can use svn to get it. I think it is available here: http://hadoop.apache.org/common/version_control.html.

129

answered Sep 25 '22 10:09

Chaos

Related questions
                            
                                Compute the different ways to make (money) change from $167.37?
                            
                                MVVM pattern in java
                            
                                Is there a C# equivalent of Java's Annotation Processing Tool (apt)?
                            
                                How to find annotated classes in OSGi bundle
                            
                                Is it wrong to re-obfuscate obfuscated code? [closed]
                            
                                Designing DTOs that have foreign key relationships
                            
                                How can I express partial intervals using Joda-Time?
                            
                                How to find source of Timer thread?
                            
                                Storing List<String> with XStream with defined names
                            
                                Wrong data type returned for date in jtds.jar
                            
                                Detecting JComboBox editing
                            
                                Hibernate Entities from Multiple Databases
                            
                                How to write unit tests that tests concurrency invariants
                            
                                How do I specify fallback fonts in Java2D/Graphics2D
                            
                                How to map an abstract collection with jpa?
                            
                                Is it possible to implement distributed caching using of Ehcache without Terracotta Enterprise Suite?
                            
                                Integrating Jetty with RESTEasy
                            
                                How to automate the Java build process for IntelliJ IDEA 11 Projects?
                            
                                Java BitSet which allows easy Concatenation of BitSets
                            
                                Overriding Enum#toString not desirable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop: How does OutputCollector work during MapReduce?

Tags:

java

hadoop

mapreduce

catty

People also ask

1 Answers

Chaos

Recent Activity

Donate For Us