Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How many types of InputFormat is there in Hadoop?

Tags:

hadoop

I'm new to Hadoop and wondering how many types of InputFormat are there in Hadoop such as TextInputFormat? Is there a certain InputFormat that I can use to read files via http requests to remote data servers?

Thanks :)

like image 882
Trams Avatar asked Dec 08 '15 03:12

Trams


2 Answers

There are many classes implementing InputFormat

CombineFileInputFormat, CombineSequenceFileInputFormat, 
CombineTextInputFormat, CompositeInputFormat, DBInputFormat,
FileInputFormat, FixedLengthInputFormat, KeyValueTextInputFormat, 
MultiFileInputFormat, NLineInputFormat, Parser.Node, 
SequenceFileAsBinaryInputFormat, SequenceFileAsTextInputFormat, 
SequenceFileInputFilter, SequenceFileInputFormat, TextInputFormat

Have a look at this article on when to use which type of Inputformat.

Out of these, most frequently used formats are:

  • FileInputFormat : Base class for all file-based InputFormats
  • KeyValueTextInputFormat : An InputFormat for plain text files. Files are broken into lines. Either line feed or carriage-return are used to signal end of line. Each line is divided into key and value parts by a separator byte. If no such a byte exists, the key will be the entire line and value will be empty.
  • TextInputFormat : An InputFormat for plain text files. Files are broken into lines. Either linefeed or carriage-return are used to signal end of line. Keys are the position in the file, and values are the line of text..
  • NLineInputFormat : NLineInputFormat which splits N lines of input as one split. In many "pleasantly" parallel applications, each process/mapper processes the same input file (s), but with computations are controlled by different parameters.
  • SequenceFileInputFormat : An InputFormat for SequenceFiles.

Regarding second query, get the files from remote servers first and use appropriate InputFileFormat depending on contents in file. Hadoop works best for data locality.

like image 79
Ravindra babu Avatar answered Sep 21 '22 22:09

Ravindra babu


Your first question - how many types of InputFormat are there in Hadoop such as TextInputFormat?

  1. TextInputFormat - each line will be treated as value
  2. KeyValueTextInputFormat - First value before delimiter is key and rest is value
  3. FixedLengthInputFormat - Each fixed length value is considered to be value
  4. NLineInputFormat - N number of lines is considered one value/record
  5. SequenceFileInputFormat - For binary

Also there is DBInputFormat to read from databases

You second question - there is no input format to read files via http requests.

like image 33
Durga Viswanath Gadiraju Avatar answered Sep 20 '22 22:09

Durga Viswanath Gadiraju