Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently Storing the data in Hive

How can I efficiently store data in Hive and also store and retrieve compressed data in hive? Currently I am storing it as a TextFile. I was going through Bejoy article and I found that LZO compression will be good for storing the files and also it is splittable.

I have one HiveQL Select query that is generating some output and I am storing that output somewhere so that one of my Hive table (quality) can use that data so that I can query that quality table.

Below is the quality table in which I am loading the data from the below SELECT query by making the partition I am using to overwrite table quality.

create table quality
(id bigint,
  total bigint,
  error bigint
 )
partitioned by (ds string)
row format delimited fields terminated by '\t'
stored as textfile
location '/user/uname/quality'
;

insert overwrite table quality partition (ds='20120709')
SELECT id  , count2 , coalesce(error, cast(0 AS BIGINT)) AS count1  FROM Table1;

So here currently I am storing it as a TextFile, should I make this as a Sequence file and start storing the data in LZO compression format? Or text file will be fine here also? As from the select query I will be getting some GB of data, that need to be uploaded on table quality on a daily basis.

So which way is best? Should I store the output as a TextFile or SequenceFile format (LZO compression) so that when I am querying the Hive quality table, I am getting result fasters. Means querying is faster.

Update:-

What If I am storing as a SequenceFile with Block Compression? Like below-

set mapred.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec;

I need to set some other things to enable BLOCK Compression apart from above? And also I am creating Table as a SequenceFile format

Update Again

I should create the table like this below? Or some other changes need to be made to enable BLOCK compression with Sequence File?

create table lipy
( buyer_id bigint,
  total_chkout bigint,
  total_errpds bigint
 )
 partitioned by (dt string)
row format delimited fields terminated by '\t'
stored as sequencefile
location '/apps/hdmi-technology/lipy'
;
like image 995
arsenal Avatar asked Aug 01 '12 19:08

arsenal


People also ask

How much data does a hive table store?

The data corresponding to hive tables are stored as delimited files in hdfs. Since it is used for data warehousing, the data for production system hive tables would definitely be at least in terms of hundreds of gigs. Now naturally the question arises, how efficiently we can store this data, definitely it has to be compressed.

What is the use of HDFS in hive?

HDFS is a distributed file system that provides high-performance access to data across Hadoop clusters. How to efficiently store data in hive/ Store and retrieve compressed data in hive? Hive is a data warehousing tool built on top of hadoop. The data corresponding to hive tables are stored as delimited files in hdfs.

What is a a database in hive?

A database is a series of bytes that is managed by a database management system (DBMS). Where does hive store data? Hive and Pig work on the principle of schema on read. The data is loaded into HDFS and stored in files within directories. The schema is applied during Hive queries and Pig data flow executions.

How to use hive for XML?

First we will see how we can use Hive for XML. In this, we are going to load XML data into Hive tables, and we will fetch the values stored inside the XML tags. Step 1) Creation of Table "xmlsample_guru" with str column with string data type. Step 2) Using XPath () method we will be able to fetch the data stored inside XML tags.


1 Answers

I have not used Hive much, but from experience with Hadoop and structured data, I was getting the best performance from SequenceFiles with BLOCK compression. The default is row compression, but it is not as efficient as BLOCK compression when you store structured data and rows are not particularly big. To switch it on I used mapred.output.compression.type=BLOCK

like image 142
alexeipab Avatar answered Oct 22 '22 20:10

alexeipab