We have 3 settings for hadoop replication namely:
dfs.replication.max = 10
dfs.replication.min = 1
dfs.replication = 2
So dfs.replication is default replication of files in hadoop cluster until a hadoop client is setting it manually using "setrep". and a hadoop client can set max replication up to dfs.replication.mx.
dfs.relication.min is used in two cases:
So we have to set these configuration on each node (namenode+datanode) or only on client node?
What if setting for above three settings vary on different datanodes?
Replication factor can’t be set for any specific node in cluster, you can set it for entire cluster/directory/file. dfs.replication
can be updated in running cluster in hdfs-sie.xml.
Set the replication factor for a file- hadoop dfs -setrep -w <rep-number> file-path
Or set it recursively for directory or for entire cluster- hadoop fs -setrep -R -w 1 /
Use of min and max rep factor-
While writing the data to datanode it is possible that many datanodes may fail. If the dfs.namenode.replication.min
replicas written then the write operation succeed. Post to write operation the blocks replicated asynchronously until it reaches to dfs.replication
level.
The max replication factor dfs.replication.max
is used to set the replication limit of blocks. A user can’t set the blocks replication more than the limit while creating the file.
You can set the high replication factor for blocks of popular file to distribute the read load on the cluster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With