I have a class ImageInputFormat
in Hadoop which reads images from HDFS. How to use my InputFormat in Spark?
Here is my ImageInputFormat
:
public class ImageInputFormat extends FileInputFormat<Text, ImageWritable> {
@Override
public ImageRecordReader createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException, InterruptedException {
return new ImageRecordReader();
}
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
}
D - The default input format is TextInputFormat with byte offset as a key and entire line as a value.
Spark's default file format is Parquet. Parquet has a number of advantages that improves the performance of querying and filtering the data.
It is the base class for all file-based InputFormats. Hadoop FileInputFormat specifies input directory where data files are located. When we start a Hadoop job, FileInputFormat is provided with a path containing files to read. FileInputFormat will read all files and divides these files into one or more InputSplits.
The SparkContext has a method called hadoopFile
. It accepts classes implementing the interface org.apache.hadoop.mapred.InputFormat
Its description says "Get an RDD for a Hadoop file with an arbitrary InputFormat".
Also have a look at the Spark Documentation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With