Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HDFS replication factor

Tags:

hadoop

hdfs

When I'm uploading a file to HDFS, if I set the replication factor to 1 then the file splits gonna reside on one single machine or the splits would be distributed to multiple machines across the network ?

hadoop fs -D dfs.replication=1 -copyFromLocal file.txt /user/ablimit
like image 338
ablimit Avatar asked Oct 03 '11 03:10

ablimit


People also ask

What is replication factor in HDFS and how can we set it?

replication property in hdfs-site. xml will change the default replication for all files placed in HDFS. You can also change the replication factor on a per-file basis using the Hadoop FS shell. Alternatively, you can change the replication factor of all the files under a directory.

What are the replication factors?

RPA (also called replication factor A, RFA) is the eukaryotic single-stranded DNA-binding protein. It is a heterotrimeric complex of proteins with molecular masses of 14, 32, and 70 kDa; these essential proteins have been found in all eukarya studied. SSBs are essential components of all replication systems.

How does HDFS detect replication factor?

Try to use command hadoop fs -stat %r /path/to/file , it should print the replication factor. Save this answer. Show activity on this post. The second column in the output signify replication factor for the file and for the folder it shows - , as shown in below pic.

What is the common replication factor size in HDFS?

HDFS provides fault tolerance by replicating the data blocks and distributing it among different DataNodes across the cluster. By default, this replication factor is set to 3 which is configurable.


2 Answers

According to the Hadoop : Definitive Guide

Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.

This logic makes sense as it decreases the network chatter between the different nodes. But, the book was published in 2009 and there had been a lot of changes in the Hadoop framework.

I think it depends on, whether the client is same as a Hadoop node or not. If the client is a Hadoop node then all the splits will be on the same node. This doesn't provide any better read/write throughput in-spite of having multiple nodes in the cluster. If the client is not same as the Hadoop node, then the node is chosen at random for each split, so the splits are spread across the nodes in a cluster. Now, this provides a better read/write throughput.

One advantage of writing to multiple nodes is that even if one of the node goes down, a couple of splits might be down, but at least some data can be recovered somehow from the remaining splits.

like image 82
Praveen Sripati Avatar answered Sep 22 '22 08:09

Praveen Sripati


If you set replication to be 1, then the file will be present only on the client node, that is the node from where you are uploading the file.

like image 34
Hari Menon Avatar answered Sep 20 '22 08:09

Hari Menon