Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive?

Im new to Bigdata and currently learning Hive. I understood the concept of InputFormat & OutputFormat in Hive as part of SerDe. I also understood that 'Stored as' is used to store a file in a particular format just like InputFormat. But I don't understand what is the significant difference between using the 'InputFormat, OutputFormat' & 'Stored as'.

Any help is appreciated.

like image 611
Metadata Avatar asked Feb 23 '17 12:02

Metadata


People also ask

What is InputFormat and OutputFormat in hive?

InputFormat and OutputFormat - allows you to describe you the original data structure so that Hive could properly map it to the table view. SerDe - represents the class which performs actual translation of data from table view to the low level input-output format structures and opposite.

What is OutputFormat class?

OutputFormat describes the output-specification for a Map-Reduce job. The Map-Reduce framework relies on the OutputFormat of the job to: Validate the output-specification of the job. For e.g. check that the output directory doesn't already exist.

What are the different output format in Hadoop?

In this Hadoop Reducer Output Format guide, will also discuss various types of Output Format in Hadoop like textOutputFormat, sequenceFileOutputFormat, mapFileOutputFormat, sequenceFileAsBinaryOutputFormat, DBOutputFormat, LazyOutputForma, and MultipleOutputs.

What is input format?

An input format describes how to interpret the contents of an input field as a number or a string. It might specify that the field contains an ordinary decimal number, a time or date, a number in binary or hexadecimal notation, or one of several other notations.


1 Answers

Hive has a lot of options of how to store the data. You can either use external storage where Hive would just wrap some data from other place or you can create standalone table from start in hive warehouse. Input and Output formats allows you to specify the original data structure of these two types of tables or how the data will be physically stored. From your client side you will keep working with a table using sql, but on the low level it would be either text file or sequence file or hbase table or some other data structure.

InputFormat and OutputFormat - allows you to describe you the original data structure so that Hive could properly map it to the table view

SerDe - represents the class which performs actual translation of data from table view to the low level input-output format structures and opposite

Generally your process would be like this: HDFS files --> InputFileFormat --> Deserializer --> Row object --> Serializer --> OutputFileFormat --> HDFS files

Stored as - specifies such storage format which includes Input and Output formats for you new tables in Hive

These attributes can really affect the performance, the overall size, data schema evolution support or enable such features as ACID. You can follow the steps described in this article to see things are working on the low level and to get some general information about most commonly used formats - https://oyermolenko.blog/2017/02/16/structuring-hadoop-data-through-hive-and-sql

like image 200
Alex Avatar answered Sep 29 '22 13:09

Alex