Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop: HDFS File Writes & Reads

Tags:

hadoop

hdfs

I have a basic question regarding file writes and reads in HDFS.

For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

If this is the case, then isn't the time taken to write each block is more than a traditional file write due to -

  1. the replication factor (default is 3) and
  2. the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following questions below:

  1. My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write (i.e. if the replication is set to 1) + some overhead involved in the distributed communication mechanism.
  2. Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.
like image 578
Vijay Bhoomireddy Avatar asked Jun 19 '14 07:06

Vijay Bhoomireddy


People also ask

How HDFS Read & write files?

HDFS follows Write Once Read Many models. So, we can't edit files that are already stored in HDFS, but we can include them by again reopening the file. This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the data nodes in the cluster.

Is HDFS Write Once Read Many?

HDFS supports write-once-read-many semantics on files. A typical block size used by HDFS is 128 MB. Thus, an HDFS file is chopped up into 128 MB chunks, and if possible, each chunk will reside on a different DataNode.

How a client read and write data in HDFS?

The client interacts with HDFS DataNodeThe client will show the security tokens provided by NameNode to the DataNodes and start reading data from the DataNode. The data will flow directly from the DataNode to the client.

Can we edit a file once written in HDFS?

You can not modified data once stored in hdfs because hdfs follows Write Once Read Many model.


1 Answers

Though your above explanation of a file write is correct, a DataNode can read and write data simultaneously. From HDFS Architecture Guide:

a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline

A write operation takes more time than on a traditional file system (due to bandwidth issues and general overhead) but not as much as 3x (assuming a replication factor of 3).

like image 163
Santiago Cepas Avatar answered Oct 15 '22 16:10

Santiago Cepas