Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Relationship between Hadoop and databases

Ok..I have tried searching the web and this site for an answer to this question which seems like a very basic question. I am complete noob to big data processing.

I want to know the relationship between HDFS and databases. Is it always necessary that to use HDFS, the data be in a some NoSQL format? Is there a specific database that always comes attached when using HDFS? I know cloudera offers Hadoop solutions and they use HBase.

Can I use a relational database as the native database for Hadoop?

like image 349
crossvalidator Avatar asked Jul 03 '13 21:07

crossvalidator


People also ask

Can Hadoop be used as a database?

Is Hadoop a Database? Hadoop is not a database, but rather an open-source software framework specifically built to handle large volumes of structured and semi-structured data.

How is Hadoop related to big data in DBMS?

Apache Hadoop is an open source framework that is used to efficiently store and process large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large computer to store and process the data, Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly.

Is Hadoop relational database?

Unlike Relational Database Management System (RDBMS), we cannot call Hadoop a database, but it is more of a distributed file system that can store and process a huge volume of data sets across a cluster of computers. Hadoop has two major components: HDFS (Hadoop Distributed File System) and MapReduce.


1 Answers

I want to know the relationship between HDFS and databases.

There is no relation as such between the 2. If you still want to find some similarity, the only thing which is common between the 2 is provision to store data. But this is analogous to any FS and DB combination. MySQL and ext3, for example. You say that you are storing data in MySQL, but eventually your data in getting stored on top your FS. Usually folks use NoSQL databases, like HBase, on top of their Hadoop cluster to exploit parallelism and distributed behavior provided by HDFS.

Is it always necessary that to use HDFS, the data be in a some NoSQL format?

There is actually nothing like NoSQL format. You can use HDFS for any kind of data, text, binary, xml etc etc.

Is there a specific database that always comes attached when using HDFS?

No. The only thing which comes coupled with HDFS is MapReduce framework. You can obviously make a DB to work with HDFS. Folks often use NoSQL DBs on top of HDFS. There are several choices like Cassandra, HBase etc. It's totally upto you to decide which one to use.

Can I use a relational database as the native database for Hadoop?

There is no OOTB feature which allows this. Moreover, it doesn't make much sense to use RDBMSs with Hadoop. Hadoop was developed for the times when RDBMS is not the suitable option, like handling PBs of data, handling unstructured data etc. Having said that, you must not think of Hadoop as a replacement to the RDBMBs. Both have entirely different goals.

EDIT :

Normally folks use NoSQL DBs(like HBase, Cassandra) with Hadoop. Using these DBs with hadoop is merely a matter of configuration. You don't need any connecting program in order to achieve this. Apart from the point made by @Doctor Dan, there are few other reasons behind choosing NoSQL DBs in place of SQL DBs. One thing is size. These NoSQL DBs provided great horizontal scalibilty which enable you to store PBs of data easily. You could scale traditional systems, but vertically. Another reason the is complexity of data. The places, where these DBs are being used, mostly handle highly unstructured data which is not very easy to deal with using traditional systems. For example, sensor data, log data etc.

Basically, I did not understand why does SQOOP exist. Why can't we directly use an SQL data on Hadoop.

Although Hadoop is very good at handling your BigData needs, it is not the solution to all your needs. It is not suitable for real-time needs. Suppose you are an Online Transaction Company with very very huge dataset. You find out that you could process this data very easily using Hadoop. But the problem is that you can't serve the real-time needs of you customers with Hadoop. This is where SQOOP comes into picture. It is an import/export tool that allows you to move data between a SQL DB and Hadoop. You could move your BigData into your Hadoop cluster, process it there and then push the results back into your SQL DB using SQOOP to serve the real-time needs of your customers.

HTH

like image 160
Tariq Avatar answered Sep 18 '22 15:09

Tariq