Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does Hadoop get input data not stored on HDFS?

I'm trying to wrap my brain around Hadoop and read this excellent tutorial as well as perused the official Hadoop docs. However, in none of this literature can I find a simple explanation for something pretty rudimentary:

In all the contrived "Hello World!" (word count) introductory MR examples, the input data is stored directly in text files. However, to me, it feels like this would seldom be the case out in the real world. I would imagine that in reality, the input data would exist in large data stores, like a relational DB, Mongo, Cassandra, or only available via REST API, etc.

So I ask: In the real world, how does Hadoop get its input data? I do see that there are projects like Sqoop and Flume and am wondering if the whole point of these frameworks is to simply ETL input data onto HDFS for running MR jobs.

like image 483
smeeb Avatar asked Jun 25 '15 08:06

smeeb


1 Answers

Actually HDFS is needed in the Real world application for many reasons.

  • Very high bandwidth to support Map Reduce workloads and Scalability.
  • Data reliability and fault tolerant. Due to replication and by distributed nature. Required for critical data systems.
  • Flexibility - You don't have to pre-process the data to store that in HDFS.

Hadoop is designed to be write once and read many concept. Kafka, Flume and Sqoop which are generally used for ingestion are themselves very fault tolerant and provide high-bandwidth for data ingestion to HDFS. Sometimes it is required to ingest data from thousands for sources per minute with data in GBs. For this these tools are required as well as fault tolerant storage system-HDFS.

like image 122
Anshul Joshi Avatar answered Sep 29 '22 11:09

Anshul Joshi