Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What exactly Non DFS Used means?

Tags:

This is what I saw on Web UI recently

 Configured Capacity     :   232.5 GB
 DFS Used    :   112.44 GB
 Non DFS Used    :   119.46 GB
 DFS Remaining   :   613.88 MB
 DFS Used%   :   48.36 %
 DFS Remaining%  :   0.26 %

and I'm so confused that non-dfs Used takes up more than half of capacity,

which I think means half of hadoop storage is being wasted

After spending meaningless time searching, I just formatted namenode, and started from scratch.

And then I copied one huge text file(about 19gigabytes) from local to HDFS (successed).

Now the UI says

Configured Capacity  :   232.5 GB
DFS Used     :   38.52 GB
Non DFS Used     :   45.35 GB
DFS Remaining    :   148.62 GB
DFS Used%    :   16.57 %
DFS Remaining%   :   63.92 %

before copying, DFS Used and Non DFS Used were both 0.

Because DFS Used is approximately double the original text file size and I configured 2 copy,

I guess that DFS Used is composed up of 2 copies of original and meta.

But still I don't have any idea where Non DFS Used came from and why is that takes up so much capcity more than DFS Used.

What happend? Did I made mistake?

like image 290
Adrian Seungjin Lee Avatar asked Aug 28 '13 01:08

Adrian Seungjin Lee


People also ask

What is non DFS used?

Answer. The Non DFS Used value does not represent the true disk space usage by non hadoop files instead its an indicative value which is derived using the given formula.

How do I clear Hdfs disk usage?

Simply follow this path; from the Ambari Dashboard, click HDFS -> Configs -> Advanced -> Advanced core-site. Then set the 'fs. trash. interval' to 0 to disable.

What is DFS data in Hadoop?

dfs.data.dir. ${hadoop.tmp.dir}/dfs/data. Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored.

What is Hdfs read write?

HDFS follows Write Once Read Many models. So, we can't edit files that are already stored in HDFS, but we can include them by again reopening the file. This design allows HDFS to scale to a large number of concurrent clients because the data traffic is spread across all the data nodes in the cluster.


2 Answers

"Non DFS used" is calculated by following formula:

Non DFS Used = Configured Capacity - Remaining Space - DFS Used

It is still confusing, at least for me.

Because Configured Capacity = Total Disk Space - Reserved Space.

So Non DFS used = ( Total Disk Space - Reserved Space) - Remaining Space - DFS Used

Let's take a example. Assuming I have 100 GB disk, and I set the reserved space (dfs.datanode.du.reserved) to 30 GB.

In the disk, the system and other files used up to 40 GB, DFS Used 10 GB. If you run df -h , you will see the available space is 50GB for that disk volume.

In HDFS web UI, it will show

Non DFS used = 100GB(Total) - 30 GB( Reserved) - 10 GB (DFS used) - 50GB(Remaining) = 10 GB

So it actually means, you initially configured to reserve 30G for non dfs usage, and 70 G for HDFS. However, it turns out non dfs usage exceeds the 30G reservation and eat up 10 GB space which should belongs to HDFS!

The term "Non DFS used" should really be renamed to something like "How much configured DFS capacity are occupied by non dfs use"

And one should stop try to figure out why the non dfs use are so high inside hadoop.

One useful command is lsof | grep delete, which will help you identify those open file which has been deleted. Sometimes, Hadoop processes (like hive, yarn, and mapred and hdfs) may hold reference to those already deleted files. And these references will occupy disk space.

Also du -hsx * | sort -rh | head -10 helps list the top ten largest folders.

like image 119
Tim Fei Avatar answered Nov 29 '22 11:11

Tim Fei


Non DFS used is any data in the filesystem of the data node(s) that isn't in dfs.data.dirs. This would include log files, mapreduce shuffle output and local copies of data files (if you put them on a data node). Use du or a similar tool to see whats taking up the space in your filesystem.

like image 44
highlycaffeinated Avatar answered Nov 29 '22 10:11

highlycaffeinated