Hadoop: HDFS File Writes & Reads

Tags:

hdfs

I have a basic question regarding file writes and reads in HDFS.

For example, if I am writing a file, using the default configurations, Hadoop internally has to write each block to 3 data nodes. My understanding is that for each block, first the client writes the block to the first data node in the pipeline which will then inform the second and so on. Once the third data node successfully receives the block, it provides an acknowledgement back to data node 2 and finally to the client through Data node 1. Only after receiving the acknowledgement for the block, the write is considered successful and the client proceeds to write the next block.

If this is the case, then isn't the time taken to write each block is more than a traditional file write due to -

the replication factor (default is 3) and
the write process is happening sequentially block after block.

Please correct me if I am wrong in my understanding. Also, the following questions below:

My understanding is that File read / write in Hadoop doesn't have any parallelism and the best it can perform is same to a traditional file read or write (i.e. if the replication is set to 1) + some overhead involved in the distributed communication mechanism.
Parallelism is provided only during the data processing phase via Map Reduce, but not during file read / write by a client.

578

asked Jun 19 '14 07:06

Vijay Bhoomireddy

1 Answers

Though your above explanation of a file write is correct, a DataNode can read and write data simultaneously. From HDFS Architecture Guide:

a DataNode can be receiving data from the previous one in the pipeline and at the same time forwarding data to the next one in the pipeline

A write operation takes more time than on a traditional file system (due to bandwidth issues and general overhead) but not as much as 3x (assuming a replication factor of 3).

163

answered Oct 15 '22 16:10

Santiago Cepas

Related questions
                            
                                Compute Statistical mode in Hive
                            
                                Spark give Null pointer exception during InputSplit for Hbase
                            
                                how to pass variables in hive using hue
                            
                                Java or C++ API for Apache Drill
                            
                                Not able to fetch result from hive transaction enabled table through spark-sql
                            
                                Limit YARN containers programmatically
                            
                                How to make HDFS work in docker swarm
                            
                                Map Reduce Frameworks/Infrastructure
                            
                                0.20.2 API hadoop version with java 5
                            
                                Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB
                            
                                hadoop multiple already being created exception
                            
                                Using s3distcp with Amazon EMR to copy a single file
                            
                                Hive performance
                            
                                Hadoop ClassNotFoundException related to MapClass
                            
                                File jobtracker.info could only be replicated to 0 nodes, instead of 1
                            
                                Overriding RecordReader to read Paragraph at once instead of line
                            
                                Hadoop profile output - where and what?
                            
                                Getting "No space left on device" for approx. 10 GB of data on EMR m1.large instances
                            
                                How to run a Spark-java program from command line [closed]
                            
                                Apache Spark Throws java.lang.IllegalStateException: unread block data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With