I have a hive table like
CREATE TABLE beacons
(
foo string,
bar string,
foonotbar string
)
COMMENT "Digest of daily beacons, by day"
PARTITIONED BY ( day string COMMENt "In YYYY-MM-DD format" );
To populate, I am doing something like:
SET hive.exec.compress.output=True;
SET io.seqfile.compression.type=BLOCK;
INSERT OVERWRITE TABLE beacons PARTITION ( day = "2011-01-26" ) SELECT
someFunc(query, "foo") as foo,
someFunc(query, "bar") as bar,
otherFunc(query, "foo||bar") as foonotbar
)
FROM raw_logs
WHERE day = "2011-01-26";
This builds a new partition with the individual products compressed through deflate, but the ideal here would be to go through the LZO compression codec instead.
Unfortunately I am not exactly sure how to accomplish that, but I assume it's one of the many runtime settings or perhaps just an additional line in the CREATE TABLE DDL.
The INSERT OVERWRITE DIRECTORY with Hive format overwrites the existing data in the directory with the new values using Hive SerDe . Hive support must be enabled to use this command. The inserted rows can be specified by value expressions or result from a query.
INSERT OVERWRITE is used to replace any existing data in the table or partition and insert with the new rows.
Yes. It is to overwrite the data.
Before the INSERT OVERWRITE prepend with the following runtime configuration values:
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = com.hadoop.compression.lzo.LzopCodec;
Also make sure you have the desired compression codec by checking:
io.compression.codecs
Further information about io.seqfile.compression.type can be found here http://wiki.apache.org/hadoop/Hive/CompressedStorage
I maybe mistaken, but it seemed like BLOCK type would ensure larger files compressed at a higher ratio vs. a smaller set of lower compressed files.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With