I am somewhat new to Apache Hadoop. I have seen this and this questions about Hadoop, HBase, Pig, Hive and HDFS. Both of them describe comparisons between above technologies.
But, I have seen that, typically a Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban).
Can some one explain the relationship of those components/technologies with their responsibilities inside a Hadoop environment, in an architectural workflow manner? preferably with an example?
Hive can store the data in external tables so it's not mandatory to used HDFS also it support file formats such as ORC, Avro files, Sequence File and Text files, etc. Hive is an SQL Based tool that builds over Hadoop to process the data. Hadoop can understand Map Reduce only.
HBase™: A scalable, distributed database that supports structured data storage for large tables. Hive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying. Pig™: A high-level data-flow language and execution framework for parallel computation.
Difference between HBase and Hive. HDFS is the backbone for HBase, a column-oriented distributed database. It's an open-source NoSQL database with rows and columns. Apache Hive is a Hadoop-based open-source data warehouse.
Hadoop and HBase are both used to store a massive amount of data. But the difference is that in Hadoop Distributed File System (HDFS) data is stored is a distributed manner across different nodes on that network. Whereas, HBase is a database that stores data in the form of columns and rows in a Table.
General Overview:
HDFS is Hadoop's Distributed File System. Intuitively you can think of this as a filesystem that spans many servers.
HBASE is a column oriented datastore. It is modeled after Google's Big Table, but if that's not something you knew about then think of it as a non-relational database that provides real time read/write access to data. It is integrated into Hadoop.
Pig and Hive are ways of querying for data in the Hadoop ecosystem. The main difference being that Hive is more like SQL than Pig. Pig uses what is called Pig Latin.
Azkaban is a prison, I mean batch workflow job scheduler. So basically it's similar to Oozie in that you can run map/reduce, pig, hive, bash, etc as a single job.
At the highest level possible, you can think of HDFS being your filesystem with HBASE as the datastore. Pig and Hive would be your means of querying from your datastore. Then Azkaban would be your way of scheduling jobs.
Stretched Example:
If you're familiar with Linux ext3 or ext4 for a filesystem, MySQL/Postgresql/MariaDB/etc for a database, SQL to access the data, and cron to schedule jobs. (You can interchange ext3/ext4 for NTFS and cron for Task Scheduler on Windows)
HDFS takes the place of ext3 or ext4 (and is distributed), HBASE takes the database role (and is non-relational!), Pig/Hive are a way to access the data, and Azkaban is a way to schedule jobs.
NOTE: This is not an apples to apples comparison. It's just to demonstrate that the Hadoop components are an abstraction meant to give you a workflow that you're already probably familiar with.
I highly encourage you to look into the components further as you'll have a good amount of fun. Hadoop has so many interchangeable components ( Yarn,Kafka,Oozie,Ambari, ZooKeeper, Sqoop, Spark, etc) that you'll be asking this question to yourself a lot.
EDIT: The links you posted went more into detail about HBase and Hive/Pig so I tried to give an intuitive picture of how they all fit together.
Hadoop environment contains all these components (HDFS, HBase, Pig, Hive, Azkaban). Short description of them can be:-
HDFS -storage in hadoop framework.
HBase - It is columnar database. where you store data in the form of column for faster access. yes , it does use hdfs as its storage.
Pig - data flow language, its community has provided builtin functions to load and process semi structured data like json and xml along with structured data.
Hive - Querying language to run queries over tables, table mounting is necessary here to play with HDFS data.
Azkaban - If you have pipeline of hadoop jobs, you can schedule them to run at specific timings and after or before some dependency.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With