Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between 'Stored as InputFormat, OutputFormat' and 'Stored as' in Hive

Issue when executing a show create table and then executing the resulting create table statement if the table is ORC.

Using show create table, you get this:

STORED AS INPUTFORMAT
  ‘org.apache.hadoop.hive.ql.io.orc.OrcInputFormat’
OUTPUTFORMAT
  ‘org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat’

But if you create the table with those clauses, you will then get the casting error when selecting. Error likes:

Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.BinaryComparable


To fix this, just change create table statement to STORED AS ORC

But, as the answer said in the similar question: What is the difference between 'InputFormat, OutputFormat' & 'Stored as' in Hive? .

I can't figure out the reason.

like image 363
Jason Avatar asked Jun 08 '17 19:06

Jason


People also ask

What is Inputformat and Outputformat in Hive?

InputFormat and OutputFormat - allows you to describe you the original data structure so that Hive could properly map it to the table view. SerDe - represents the class which performs actual translation of data from table view to the low level input-output format structures and opposite.

What is input format in Hive?

¶ Qubole Hive uses CombineHiveInputFormat by default and treats underlying files as text-files by default. During table definition – users can indicate that the file type of of an alternative format (such as a Sequence or RC File).

What is Tblproperties?

The TBLPROPERTIES clause allows you to tag the table definition with your own metadata key/value pairs. Some predefined table properties also exist, such as last_modified_user and last_modified_time which are automatically added and managed by Hive.


2 Answers

STORED AS implies 3 things:

  1. SERDE
  2. INPUTFORMAT
  3. OUTPUTFORMAT

You have defined only the last 2, leaving the SERDE to be defined by hive.default.serde

hive.default.serde
Default Value: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Added in: Hive 0.14 with HIVE-5976
The default SerDe Hive will use for storage formats that do not specify a SerDe.
Storage formats that currently do not specify a SerDe include 'TextFile, RcFile'.

Demo

hive.default.serde

set hive.default.serde;

hive.default.serde=org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

STORED AS ORC

create table mytable (i int) 
stored as orc;

show create table mytable;

Note that the SERDE is 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'

CREATE TABLE `mytable`(
  `i` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'file:/home/cloudera/local_db/mytable'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'numFiles'='0', 
  'numRows'='0', 
  'rawDataSize'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1496982059')

STORED AS INPUTFORMAT ... OUTPUTFORMAT ...

create table mytable2 (i int) 
STORED AS 
INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
;

show create table mytable2
;

Note that the SERDE is 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

CREATE TABLE `mytable2`(
  `i` int)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'file:/home/cloudera/local_db/mytable2'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='{\"BASIC_STATS\":\"true\"}', 
  'numFiles'='0', 
  'numRows'='0', 
  'rawDataSize'='0', 
  'totalSize'='0', 
  'transient_lastDdlTime'='1496982426')
like image 181
David דודו Markovitz Avatar answered Oct 05 '22 02:10

David דודו Markovitz


You сan specify INPUTFORMAT, OUTPUTFORMAT, SERDE in STORED AS when creating table. Hive allows you to separate your record format from your file format. You can provide custom classes for INPUTFORMAT, OUTPUTFORMAT, SERDE. See details: http://www.dummies.com/programming/big-data/hadoop/defining-table-record-formats-in-hive/

Alternatively you can write simply STORED AS ORC or STORED AS TEXTFILE for example. STORED AS ORC statement already takes care about INPUTFORMAT, OUTPUTFORMAT and SERDE. This allows you not to write those long fully qualified Java class names for INPUTFORMAT, OUTPUTFORMAT, SERDE. Just STORED AS ORC instead.

like image 20
leftjoin Avatar answered Oct 05 '22 00:10

leftjoin