HDFS replication factor

Tags:

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?

hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit

338

asked Oct 03 '11 03:10

ablimit

2 Answers

According to the Hadoop : Definitive Guide

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.

This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.

I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.

One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.

answered Sep 22 '22 08:09

Praveen Sripati

If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.

answered Sep 20 '22 08:09

Hari Menon

Related questions
                            
                                Get a yarn configuration from commandline
                            
                                Spark: Inconsistent performance number in scaling number of cores
                            
                                On what basis mapreduce framework decides whether to launch a combiner or not
                            
                                Could not find or load main class org.apache.hadoop.hdfs.server.namenode.Namenode
                            
                                no namenode error in pseudo-mode
                            
                                Hive creating a table but getting FAILED: SemanticException [Error 10035]: Column repeated in partitioning columns
                            
                                High throughput vs low latency in HDFS
                            
                                terminating a spark step in aws
                            
                                Hadoop: compress file in HDFS?
                            
                                How to delete duplicate records from Hive table?
                            
                                Wiping out DFS in Hadoop
                            
                                Hadoop FileSystem closed exception when doing BufferedReader.close()
                            
                                Redux: How do I get Jython to use Python modules stored in Lib within its own jar file when running in Hadoop?
                            
                                HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification
                            
                                Spark - Container is running beyond physical memory limits
                            
                                How to balance my data across the partitions?
                            
                                Apache Spark YARN mode startup takes too long (10+ secs)
                            
                                What's the successor of mrunit?
                            
                                Amazon S3 architecture [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HDFS replication factor

Tags:

hadoop

hdfs

ablimit

People also ask

2 Answers

Praveen Sripati

Hari Menon

Recent Activity

Donate For Us