I am new to Hadoop and HDFS and it confuses me as to why HDFS is not preferred with applications that require low latency. In a big data scenerio, we would have data spread over different community hardware, so accessing the data should be faster.
Hadoop is completely a batch processing system designed to store and analyze structured, unstructured and semistructured data.
The map/reduce framework of Hadoop is relatively slower since it is designed to support different format, structure and huge volume of data.
We should not say HDFS is slower, since the HBase no-sql database and MPP based datasources like Impala, Hawq sit on the HDFS. These datasources act faster because they do not follow mapreduce execution for data retrieval and processing.
The slowness occurs only because of the nature of the map/reduce based execution, where it produces lots of intermediate data, much data exchanged between nodes, thus causes huge disk IO latency. Furthermore it has to persist much data in disk for synchronization between phases so that it can support Job recovery from failures. Also there are no ways in mapreduce to cache the all/subset of the data in memory.
The Apache Spark is yet another batch processing system but it is relatively faster than Hadoop mapreduce since it caches much of the input data on memory by RDD and keeps intermediate data in memory itself , eventually writes the data to disk upon completion or whenever required.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With