Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hdfs moveFromLocal does not distribute replica blocks across data nodes

I recently upgraded my Cloudera environment from 5.8.x (hadoop 2.6.0, hdfs-1) to 6.3.x (hadoop 3.0.0, hdfs-1) and after some days of data loads with moveFromLocal, i just realized that the DFS Used% of datanode server on which i execute moveFromLocal are 3x more than that of others.

Then having run fsck with -blocks, -locations and -replicaDetails flags over the hdfs path to which i load the data; i observed that replicated blocks (RF=2) are all on that same server and not being distributed to other nodes unless i manually run hdfs balancer.

There is a pertinent question asked a month ago, hdfs put/moveFromLocal not distributing data across data nodes?, which does not really answer any of the questions; the files i keep loading are parquet files.

There was no such a problem in the Cloudera 5.8.x. Is there some new configuration should i make in Cloudera 6.3.x related to replication, rack awareness or something like that?

Any help would be highly appreciated.

like image 487
belce Avatar asked Jan 07 '20 11:01

belce


People also ask

How does HDFS handle data replication?

Data Replication. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance.

How data is distributed in HDFS?

HDFS also has multiple DataNodes on a commodity hardware cluster -- typically one per node in a cluster. The DataNodes are generally organized within the same rack in the data center. Data is broken down into separate blocks and distributed among the various DataNodes for storage.

What is DataNode in Hadoop?

DataNodes are the slave nodes in HDFS. The actual data is stored on DataNodes. A functional filesystem has more than one DataNode, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up.

How large files are stored in HDFS?

By default in Hadoop1, these blocks are 64MB in size, and in Hadoop2 these blocks are 128MB in size which means all the blocks that are obtained after dividing a file should be 64MB or 128MB in size. You can manually change the size of the file block in hdfs-site.


2 Answers

According to the HDFS Architecture doc, "For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode..."

Per the same doc, "Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time."

like image 82
mazaneicha Avatar answered Oct 16 '22 15:10

mazaneicha


You are probably doing moveFromLocal on one of your datanodes. Seems like you need to do your moveFromLocal from non-datanode to get even distribution on your cluster.

like image 1
hll Avatar answered Oct 16 '22 16:10

hll