Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Specifying compression codec for a INSERT OVERWRITE SELECT in Hive

I have a hive table like

  CREATE TABLE beacons
 (
     foo string,
     bar string,
     foonotbar string
 )
 COMMENT "Digest of daily beacons, by day"
 PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );

To populate, I am doing something like:

 SET hive.exec.compress.output=True;
 SET io.seqfile.compression.type=BLOCK;

 INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
   someFunc(query, "foo") as foo,
   someFunc(query, "bar") as bar,
   otherFunc(query, "foo||bar") as foonotbar
   )
  FROM raw_logs
WHERE day = "2011-01-26";

This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.

Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.

like image 864
David Avatar asked Jan 28 '11 17:01

David


People also ask

How does insert overwrite work in Hive?

The INSERT OVERWRITE DIRECTORY with Hive format overwrites the existing data in the directory with the new values using Hive SerDe . Hive support must be enabled to use this command. The inserted rows can be specified by value expressions or result from a query.

Can we overwrite partition in Hive?

INSERT OVERWRITE is used to replace any existing data in the table or partition and insert with the new rows.

Can we insert overwrite Hive external table?

Yes. It is to overwrite the data.


1 Answers

Before the INSERT OVERWRITE prepend with the following runtime configuration values:

SET hive.exec.compress.output=true; 
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;

Also make sure you have the desired compression codec by checking:

io.compression.codecs

Further information about io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage

I maybe mistaken, but it seemed like BLOCK type would ensure larger files compressed at a higher ratio vs. a smaller set of lower compressed files.

like image 93
David Avatar answered Sep 21 '22 10:09

David