Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hadoop replication factor confusion

Tags:

hadoop

We have 3 settings for hadoop replication namely:

dfs.replication.max = 10
dfs.replication.min = 1
dfs.replication     = 2

So dfs.replication is default replication of files in hadoop cluster until a hadoop client is setting it manually using "setrep". and a hadoop client can set max replication up to dfs.replication.mx.

dfs.relication.min is used in two cases:

  1. During safe mode, it checks whether replication of blocks is upto dfs.replication.min or not.
  2. dfs.replication.min are processed synchronously. and remaining dfs.replication-dfs.replication.min are processed asynchronously.

So we have to set these configuration on each node (namenode+datanode) or only on client node?

What if setting for above three settings vary on different datanodes?

like image 284
user2950086 Avatar asked Dec 15 '22 22:12

user2950086


1 Answers

Replication factor can’t be set for any specific node in cluster, you can set it for entire cluster/directory/file. dfs.replication can be updated in running cluster in hdfs-sie.xml.

Set the replication factor for a file- hadoop dfs -setrep -w <rep-number> file-path

Or set it recursively for directory or for entire cluster- hadoop fs -setrep -R -w 1 /

Use of min and max rep factor-

  1. While writing the data to datanode it is possible that many datanodes may fail. If the dfs.namenode.replication.min replicas written then the write operation succeed. Post to write operation the blocks replicated asynchronously until it reaches to dfs.replication level.

  2. The max replication factor dfs.replication.max is used to set the replication limit of blocks. A user can’t set the blocks replication more than the limit while creating the file.

  3. You can set the high replication factor for blocks of popular file to distribute the read load on the cluster.

like image 127
Rahul Sharma Avatar answered Jan 12 '23 01:01

Rahul Sharma