I'm trying to wrap my brain around Hadoop and read this excellent tutorial as well as perused the official Hadoop docs. However, in none of this literature can I find a simple explanation for something pretty rudimentary:
In all the contrived "Hello World!" (word count) introductory MR examples, the input data is stored directly in text files. However, to me, it feels like this would seldom be the case out in the real world. I would imagine that in reality, the input data would exist in large data stores, like a relational DB, Mongo, Cassandra, or only available via REST API, etc.
So I ask: In the real world, how does Hadoop get its input data? I do see that there are projects like Sqoop and Flume and am wondering if the whole point of these frameworks is to simply ETL input data onto HDFS for running MR jobs.
Actually HDFS is needed in the Real world application for many reasons.
Hadoop is designed to be write once and read many concept. Kafka, Flume and Sqoop which are generally used for ingestion are themselves very fault tolerant and provide high-bandwidth for data ingestion to HDFS. Sometimes it is required to ingest data from thousands for sources per minute with data in GBs. For this these tools are required as well as fault tolerant storage system-HDFS.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With