Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDFS Reduced Replication Factor

I've reduced the replication factor from 3 to 1, yet do not see any activity from the namenode or between datanodes to remove overly-replicated HDFS file blocks. Is there a way to monitor or force the replication job?

like image 716
Carl Sagan Avatar asked Jul 23 '13 00:07

Carl Sagan


People also ask

How does Hadoop reduce replication factor?

You can find setrep command in the Hadoop file system. This command is used to change the replication factor of a file to a specific count instead of the default replication factor for the remaining in the HDFS file system.

What is HDFS replication factor controlled?

The replication factor represents number of copies of a block that must be there in the cluster. This value is by default 3 (comprises one original block and 2 replicas). So, every time we create a file in HDFS will have a replication factor as 3. You can check the replication factor from the hdfs-site.

What is the default replication factor in HDFS * 2 3 4 5?

By default, this replication factor is set to 3 which is configurable.

What are the benefits of increasing the replication factor of files in HDFS?

Results show that increasing the replication factor of the »hot» data increases the availability and locality of the data, and thus, decreases the job execution time. Content may be subject to copyright. increases the availability and locality of the data, and thus, decreases the job execution time.


2 Answers

Changing dfs.replication will only apply to new files you create, but will not modify the replication factor for the already existing files.

To change replication factor for files that already exist, you could run the following command which will be run recursively on all files in HDFS:

hadoop dfs -setrep -w 1 -R /
like image 59
Charles Menguy Avatar answered Oct 05 '22 20:10

Charles Menguy


When you change the default replication factor from 3 to let's say 2 from cloudera manager

Cloudera Manager(CDH 5.0.2) -> HDFS -> Configuration -> View and Edit -> Service-Wide -> Replication -> Replication Factor (dfs.replication) -> 2

then only new data written will have 2 replicas for each block.

Please use

hdfs dfs -setrep 2 /

on command line (generally a node with HDFS Gateway Role) if you want to change the replication factor of all the existing data. This command recursively changes the replication factor of all files under the root directory /.

Syntax:

hdfs dfs -setrep [-R] [-w] <numReplicas> <path>

where 

-w flag requests that the command wait for the replication to complete and can take a very long time

-R flag is just for backwards compatibility and has no effect

Reference:

http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.2.0-cdh5.0.0-beta-2/hadoop-project-dist/hadoop-common/FileSystemShell.html#setrep

like image 45
Ankit Rakha Avatar answered Oct 05 '22 22:10

Ankit Rakha