Programmatically reading the output of Hadoop Mapreduce Program

Tags:

This may be a basic question, but I could not find an answer for it on Google.
I have a map-reduce job that creates multiple output files in its output directory. My Java application executes this job on a remote hadoop cluster and after the job is finished, it needs to read the output programatically using org.apache.hadoop.fs.FileSystem API. Is it possible?
The application knows the output directory, but not the names of the output files generated by the map-reduce job. It seems there is no way to programatically list the contents of a directory in the hadoop file system API. How will the output files be read?
It seems such a commonplace scenario, that I am sure it has a solution. But I am missing something very obvious.

308

asked Apr 12 '11 11:04

nabeelmukhtar

1 Answers

The method you are looking for is called listStatus(Path). It simply returns all files inside of a Path as a FileStatus array. Then you can simply loop over them create a path object and read it.

    FileStatus[] fss = fs.listStatus(new Path("/"));
    for (FileStatus status : fss) {
        Path path = status.getPath();
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
        IntWritable key = new IntWritable();
        IntWritable value = new IntWritable();
        while (reader.next(key, value)) {
            System.out.println(key.get() + " | " + value.get());
        }
        reader.close();
    }

For Hadoop 2.x you can setup the reader like this:

 SequenceFile.Reader reader = 
           new SequenceFile.Reader(conf, SequenceFile.Reader.file(path))

answered Oct 26 '22 13:10

Thomas Jungblut

Related questions
                            
                                What are the consequences of adding a column to an existing HIVE table?
                            
                                MapReduce with MongoDB really, really slow (30 hours vs 20 minutes in MySQL for a equivalent database)
                            
                                Difference between a ring buffer and a queue
                            
                                Unable to Create Table in HIVE reading a CSV from HDFS
                            
                                How to kill a mapred job started by hive?
                            
                                Pyspark Removing null values from a column in dataframe
                            
                                Experience with Hadoop?
                            
                                How to print on console during MapReduce job execution in hadoop
                            
                                Setting the logging level in Hadoop to WARN
                            
                                How do I build/run this simple Mahout program without getting exceptions?
                            
                                Hadoop HADOOP_CLASSPATH issues
                            
                                InstantiationException in hadoop map reduce program
                            
                                Sample data for Hadoop [duplicate]
                            
                                Relationship between Hadoop and databases
                            
                                How to flatten a group into a single tuple in Pig?
                            
                                Spark in AWS: "S3AbortableInputStream: Not all bytes were read from the S3ObjectInputStream"
                            
                                Hadoop cluster. 2 Fast, 4 Medium, 8 slower machines?
                            
                                how to find JAR: /home/hadoop/contrib/streaming/hadoop-streaming.jar
                            
                                hadoop pagerank error when running
                            
                                Sequence Files in Hadoop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Programmatically reading the output of Hadoop Mapreduce Program

Tags:

hadoop

mapreduce

hdfs

nabeelmukhtar

People also ask

1 Answers

Thomas Jungblut

Recent Activity

Donate For Us