I recently upgraded my Cloudera environment from 5.8.x (hadoop 2.6.0, hdfs-1) to 6.3.x (hadoop 3.0.0, hdfs-1) and after some days of data loads with moveFromLocal
, i just realized that the DFS Used% of datanode server on which i execute moveFromLocal
are 3x more than that of others.
Then having run fsck
with -blocks
, -locations
and -replicaDetails
flags over the hdfs path to which i load the data; i observed that replicated blocks (RF=2) are all on that same server and not being distributed to other nodes unless i manually run hdfs balancer
.
There is a pertinent question asked a month ago, hdfs put/moveFromLocal not distributing data across data nodes?, which does not really answer any of the questions; the files i keep loading are parquet files.
There was no such a problem in the Cloudera 5.8.x. Is there some new configuration should i make in Cloudera 6.3.x related to replication, rack awareness or something like that?
Any help would be highly appreciated.
Data Replication. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance.
HDFS also has multiple DataNodes on a commodity hardware cluster -- typically one per node in a cluster. The DataNodes are generally organized within the same rack in the data center. Data is broken down into separate blocks and distributed among the various DataNodes for storage.
DataNodes are the slave nodes in HDFS. The actual data is stored on DataNodes. A functional filesystem has more than one DataNode, with data replicated across them. On startup, a DataNode connects to the NameNode; spinning until that service comes up.
By default in Hadoop1, these blocks are 64MB in size, and in Hadoop2 these blocks are 128MB in size which means all the blocks that are obtained after dividing a file should be 64MB or 128MB in size. You can manually change the size of the file block in hdfs-site.
According to the HDFS Architecture doc, "For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode..."
Per the same doc, "Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time."
You are probably doing moveFromLocal on one of your datanodes. Seems like you need to do your moveFromLocal from non-datanode to get even distribution on your cluster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With