Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Hadoop InputFormats In Apache Spark?

I have a class ImageInputFormat in Hadoop which reads images from HDFS. How to use my InputFormat in Spark?

Here is my ImageInputFormat:

public class ImageInputFormat extends FileInputFormat<Text, ImageWritable> {

    @Override
    public ImageRecordReader createRecordReader(InputSplit split, 
                  TaskAttemptContext context) throws IOException, InterruptedException {
        return new ImageRecordReader();
    }

    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }
}  
like image 869
Hellen Avatar asked Jan 09 '14 09:01

Hellen


People also ask

What is the default input format in Hadoop?

D - The default input format is TextInputFormat with byte offset as a key and entire line as a value.

What file format does Spark use?

Spark's default file format is Parquet. Parquet has a number of advantages that improves the performance of querying and filtering the data.

What is Hadoop default input and output format?

It is the base class for all file-based InputFormats. Hadoop FileInputFormat specifies input directory where data files are located. When we start a Hadoop job, FileInputFormat is provided with a path containing files to read. FileInputFormat will read all files and divides these files into one or more InputSplits.


1 Answers

The SparkContext has a method called hadoopFile. It accepts classes implementing the interface org.apache.hadoop.mapred.InputFormat

Its description says "Get an RDD for a Hadoop file with an arbitrary InputFormat".

Also have a look at the Spark Documentation.

like image 156
Robert Metzger Avatar answered Sep 17 '22 10:09

Robert Metzger