Which tool is the right one to measure HDFS space consumed?
When I sum up the output of "hdfs dfs -du /" I always get less amount of space consumed compared to "hdfs dfsadmin -report" ("DFS Used" line). Is there data that du does not take into account?
There IS a difference between the two, refer to the following figure from Apache's official documentation: As we can see here, the 'hdfs dfs' command is used very specifically for hadoop filesystem (hdfs) data operations while 'hadoop fs' covers a larger variety of data present on external platforms as well.
Use the hdfs du command to get the size of a directory in HDFS. -x to exclude snapshots from the result.
hdfs dfsadmin -report outputs a brief report on the overall HDFS filesystem. It's a useful command to quickly view how much disk is available, how many DataNodes are running, corrupted blocks etc. Note: This article explains the disk space calculations as seen by the HDFS.
Hadoop file systems provides a relabel storage, by putting a copy of data to several nodes. The number of copies is replication factor, usually it is greate then one.
Command hdfs dfs -du /
shows space consume your data without replications.
Command hdfs dfsadmin -report
(line DFS Used) shows actual disk usage, taking into account data replication. So it should be several times bigger when number getting from dfs -ud
command.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With