Relationship between HDFS, HBase, Pig, Hive and Azkaban?

2 Answers

General Overview:

HDFS is Hadoop's Distributed File System. Intuitively you can think of this as a filesystem that spans many servers.

HBASE is a column oriented datastore. It is modeled after Google's Big Table, but if that's not something you knew about then think of it as a non-relational database that provides real time read/write access to data. It is integrated into Hadoop.

Pig and Hive are ways of querying for data in the Hadoop ecosystem. The main difference being that Hive is more like SQL than Pig. Pig uses what is called Pig Latin.

Azkaban is a prison, I mean batch workflow job scheduler. So basically it's similar to Oozie in that you can run map/reduce, pig, hive, bash, etc as a single job.

At the highest level possible, you can think of HDFS being your filesystem with HBASE as the datastore. Pig and Hive would be your means of querying from your datastore. Then Azkaban would be your way of scheduling jobs.

Stretched Example:

If you're familiar with Linux ext3 or ext4 for a filesystem, MySQL/Postgresql/MariaDB/etc for a database, SQL to access the data, and cron to schedule jobs. (You can interchange ext3/ext4 for NTFS and cron for Task Scheduler on Windows)

HDFS takes the place of ext3 or ext4 (and is distributed), HBASE takes the database role (and is non-relational!), Pig/Hive are a way to access the data, and Azkaban is a way to schedule jobs.

NOTE: This is not an apples to apples comparison. It's just to demonstrate that the Hadoop components are an abstraction meant to give you a workflow that you're already probably familiar with.

I highly encourage you to look into the components further as you'll have a good amount of fun. Hadoop has so many interchangeable components ( Yarn,Kafka,Oozie,Ambari, ZooKeeper, Sqoop, Spark, etc) that you'll be asking this question to yourself a lot.

EDIT: The links you posted went more into detail about HBase and Hive/Pig so I tried to give an intuitive picture of how they all fit together.

143

answered Oct 13 '22 21:10

K.Boyette

Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban). Short description of them can be:-

HDFS -storage in hadoop framework.

HBase - It is columnar database. where you store data in the form of column for faster access. yes , it does use hdfs as its storage.

Pig - data flow language, its community has provided builtin functions to load and process semi structured data like json and xml along with structured data.

Hive - Querying language to run queries over tables, table mounting is necessary here to play with HDFS data.

Azkaban - If you have pipeline of hadoop jobs, you can schedule them to run at specific timings and after or before some dependency.

answered Oct 13 '22 22:10

sumitya

Related questions
                            
                                Do Mappers and Reducers in Hadoop have to be static classes?
                            
                                Propagating custom configuration values in Hadoop
                            
                                For Hive partition based on date, why use string type? why not int?
                            
                                How can I develop an ASP.NET web application using Hadoop as Database?
                            
                                Setting textinputformat.record.delimiter in spark
                            
                                Mapreduce: more reducers than mappers?
                            
                                group concat equivalent in pig?
                            
                                Moving file from local to HDFS
                            
                                Controling and monitorying number of simultaneous map/reduce tasks in YARN
                            
                                Trouble using hbase from java on Amazon EMR
                            
                                Need to add auto increment column in a table using hive
                            
                                How to change sqoop metastore?
                            
                                Exception while connecting to mongodb in spark
                            
                                Error Hadoop on Windows via Cygwin: Could Not Locate null\bin\winutils.exe
                            
                                Apache Giraph - Cannot run in split master / worker mode since there is only 1 task at a time
                            
                                Flag -useHCatalog not working
                            
                                How can get memory and CPU usage of hadoop yarn application?
                            
                                How to inspect the format of a file on HDFS?
                            
                                storing pig output into Hive table in a single instance
                            
                                What does container/resource allocation mean in Hadoop and in Spark when running on Yarn?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Relationship between HDFS, HBase, Pig, Hive and Azkaban?

Tags:

hadoop

hive

hbase

hdfs

azkaban

Supun Wijerathne

People also ask

2 Answers

K.Boyette

sumitya

Recent Activity

Donate For Us