Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why HDFS not preferred with applications that require low latency?

I am new to Hadoop and HDFS and it confuses me as to why HDFS is not preferred with applications that require low latency. In a big data scenerio, we would have data spread over different community hardware, so accessing the data should be faster.

like image 795
Nidhi Avatar asked Sep 29 '22 01:09

Nidhi


1 Answers

Hadoop is completely a batch processing system designed to store and analyze structured, unstructured and semistructured data.

The map/reduce framework of Hadoop is relatively slower since it is designed to support different format, structure and huge volume of data.

We should not say HDFS is slower, since the HBase no-sql database and MPP based datasources like Impala, Hawq sit on the HDFS. These datasources act faster because they do not follow mapreduce execution for data retrieval and processing.

The slowness occurs only because of the nature of the map/reduce based execution, where it produces lots of intermediate data, much data exchanged between nodes, thus causes huge disk IO latency. Furthermore it has to persist much data in disk for synchronization between phases so that it can support Job recovery from failures. Also there are no ways in mapreduce to cache the all/subset of the data in memory.

The Apache Spark is yet another batch processing system but it is relatively faster than Hadoop mapreduce since it caches much of the input data on memory by RDD and keeps intermediate data in memory itself , eventually writes the data to disk upon completion or whenever required.

like image 131
suresiva Avatar answered Oct 05 '22 06:10

suresiva