I recently upgraded my Cloudera environment from 5.8.x (hadoop 2.6.0, hdfs-1) to 6.3.x (hadoop 3.0.0, hdfs-1) and after some days of data loads with <code>moveFromLocal</code>, i just realized that the DFS Used% of datanode server on which i execute <code>moveFromLocal</code> are 3x more than that of others. Then having run <code>fsck</code> with <code>-blocks</code>, <code>-locations</code> and <code>-replicaDetails</code> flags over the hdfs path to which i load the data; i observed that replicated blocks (RF=2) are all on that same server and not being distributed to other nodes unless i manually run <code>hdfs balancer</code>. There is a pertinent question asked a month ago, hdfs put/moveFromLocal not distributing data across data nodes?, which does not really answer any of the questions; the files i keep loading are parquet files. There was no such a problem in the Cloudera 5.8.x. Is there some new configuration should i make in Cloudera 6.3.x related to replication, rack awareness or something like that? Any help would be highly appreciated.

According to the HDFS Architecture doc, "For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode..." Per the same doc, "Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time."

hdfs moveFromLocal does not distribute replica blocks across data nodes

Tags:

replication

hadoop

hdfs

cloudera

cloudera-cdh

I recently upgraded my Cloudera environment from 5.8.x (hadoop 2.6.0, hdfs-1) to 6.3.x (hadoop 3.0.0, hdfs-1) and after some days of data loads with moveFromLocal, i just realized that the DFS Used% of datanode server on which i execute moveFromLocal are 3x more than that of others.

Then having run fsck with -blocks, -locations and -replicaDetails flags over the hdfs path to which i load the data; i observed that replicated blocks (RF=2) are all on that same server and not being distributed to other nodes unless i manually run hdfs balancer.

There is a pertinent question asked a month ago, hdfs put/moveFromLocal not distributing data across data nodes?, which does not really answer any of the questions; the files i keep loading are parquet files.

There was no such a problem in the Cloudera 5.8.x. Is there some new configuration should i make in Cloudera 6.3.x related to replication, rack awareness or something like that?

Any help would be highly appreciated.

487

asked Jan 07 '20 11:01

belce

2 Answers

According to the HDFS Architecture doc, "For the common case, when the replication factor is three, HDFS’s placement policy is to put one replica on the local machine if the writer is on a datanode..."

Per the same doc, "Because the NameNode does not allow DataNodes to have multiple replicas of the same block, maximum number of replicas created is the total number of DataNodes at that time."

answered Oct 16 '22 15:10

mazaneicha

You are probably doing moveFromLocal on one of your datanodes. Seems like you need to do your moveFromLocal from non-datanode to get even distribution on your cluster.

answered Oct 16 '22 16:10

hll

Related questions
                            
                                How to run a Spark-java program from command line [closed]
                            
                                Apache Spark Throws java.lang.IllegalStateException: unread block data
                            
                                Hadoop: HDFS File Writes & Reads
                            
                                Oozie Java Action : Passing Hbase classpath
                            
                                Why hive doesn't allow create external table with CTAS?
                            
                                Opening a port on HDInsight cluster on Azure
                            
                                Magic byte in Apache Kafka
                            
                                Apache Drill connection through Java
                            
                                How to set configuration in Hive-Site.xml file for hive metastore connection?
                            
                                How to decide when to use a Map-Side Join or Reduce-Side while writing an MR code in java?
                            
                                nutch 1.10 input path does not exist /linkdb/current
                            
                                parquet version used to write a file
                            
                                hive compaction using insert overwrite partition
                            
                                Hadoop name node format warning
                            
                                HDFS as volume in cloudera quickstart docker
                            
                                Using aws credentials profiles with spark scala app
                            
                                Kafka Streams with lookup data on HDFS
                            
                                Apache Spark: In SparkSql, are sql's vulnerable to Sql Injection [duplicate]
                            
                                How Blockchain is different from HDFS and how bitcoin mining is different from Map reduce or spark?
                            
                                use of "default" in avro schema

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With