I've reduced the replication factor from 3 to 1, yet do not see any activity from the namenode or between datanodes to remove overly-replicated HDFS file blocks. Is there a way to monitor or force the replication job?
You can find setrep command in the Hadoop file system. This command is used to change the replication factor of a file to a specific count instead of the default replication factor for the remaining in the HDFS file system.
The replication factor represents number of copies of a block that must be there in the cluster. This value is by default 3 (comprises one original block and 2 replicas). So, every time we create a file in HDFS will have a replication factor as 3. You can check the replication factor from the hdfs-site.
By default, this replication factor is set to 3 which is configurable.
Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time. Content may be subject to copyright. increases the availability and locality of the data, and thus, decreases the job execution time.
Changing dfs.replication
will only apply to new files you create, but will not modify the replication factor for the already existing files.
To change replication factor for files that already exist, you could run the following command which will be run recursively on all files in HDFS:
hadoop dfs -setrep -w 1 -R /
When you change the default replication factor from 3 to let's say 2 from cloudera manager
Cloudera Manager(CDH 5.0.2) -> HDFS -> Configuration -> View and Edit -> Service-Wide -> Replication -> Replication Factor (dfs.replication) -> 2
then only new data written will have 2 replicas for each block.
Please use
hdfs dfs -setrep 2 /
on command line (generally a node with HDFS Gateway Role) if you want to change the replication factor of all the existing data. This command recursively changes the replication factor of all files under the root directory /.
Syntax:
hdfs dfs -setrep [-R] [-w] <numReplicas> <path>
where
-w flag requests that the command wait for the replication to complete and can take a very long time
-R flag is just for backwards compatibility and has no effect
Reference:
http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.2.0-cdh5.0.0-beta-2/hadoop-project-dist/hadoop-common/FileSystemShell.html#setrep
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With