I'm new to Hadoop. I know that the HCatalog is a table and storage management layer for Hadoop. But how exactly it works and how to use it. Please give some simple example.
HCatalog is a tool that allows you to access Hive metastore tables within Pig, Spark SQL, and/or custom MapReduce applications. HCatalog has a REST interface and command line client that allows you to create tables or do other operations. You then write your applications to access the tables using HCatalog libraries.
By default, HCatalog supports RCFile, CSV, JSON, SequenceFile, and ORC file formats. To use a custom format, you must provide the InputFormat, OutputFormat, and SerDe.
Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API.
Applications of HCatalog As Hive has reigned as the defacto SQL interface for Hadoop, since 2008, because it offers a relational view through SQL like language to data within Hadoop. Now, this same interface is published by HCatalog though it abstracts for data beyond Hive.
In short, HCatalog opens up the hive metadata to other mapreduce tools. Every mapreduce tools has its own notion about HDFS data (example Pig sees the HDFS data as set of files, Hive sees it as tables). With having table based abstraction, HCatalog supported mapreduce tools do not need to care about where the data is stored, in which format and storage location (HBase or HDFS).
We do get the facility of WebHcat to submit jobs in an RESTful way if you configure webhcat along Hcatalog.
Here is a very basic example of how ho use HCATALOG.
I have a table in hive ,TABLE NAME is STUDENT which is stored in one of the HDFS location:
neethu 90 malini 90 sunitha 98 mrinal 56 ravi 90 joshua 8
Now suppose I want to load this table to pig for further transformation of data, In this scenario I can use HCATALOG:
When using table information from the Hive metastore with Pig, add the -useHCatalog option when invoking pig:
pig -useHCatalog
(you may want to export HCAT_HOME 'HCAT_HOME=/usr/lib/hive-hcatalog/')
Now loading this table to pig: A = LOAD 'student' USING org.apache.hcatalog.pig.HCatLoader();
Now you have loaded the table to pig.To check the schema , just do a DESCRIBE on the relation.
DESCRIBE A
Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With