Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Where Does the HDFS Account for Triple Replication in Usage Reports?

Tags:

size

hadoop

hdfs

In the latest version of most Hadoop distributions, the HDFS usage reports seem to report on space without accounting for the replication factor, correct?

When one looks at the Namenode Web UI and/or runs the 'hadoop dfsadmin -report' command, one can see a report that looks something like this:

Configured Capacity: 247699161084 (230.69 GB)
Present Capacity: 233972113408 (217.9 GB)
DFS Remaining: 162082414592 (150.95 GB)
DFS Used: 71889698816 (66.95 GB)
DFS Used%: 30.73%
Under replicated blocks: 40
Blocks with corrupt replicas: 6
Missing blocks: 0

Based on the machine sizes of this cluster, it seems that this report does NOT account for triple replication... I.E. If I place a file on the HDFS, I should account for the triple replication myself.

For example, if I placed a 50GB file on the HDFS, would my HDFS be dangerously close to full (since it seems that file would be replicated 3 times, using up the 150GB that currently remain)?

like image 473
depthfirstdesigner Avatar asked Dec 03 '22 22:12

depthfirstdesigner


2 Answers

Let us define clearly what each of these terms mean.

  1. Configured Capacity: It is the total capacity available to HDFS for Storage. So if you have 4 nodes and each node has 50 GB capacity, the configured capacity will be 200 GB. Replication factor is irrelevant in case of configured capacity.

  2. DFS Used: This is the amount of storage space that has been used up by HDFS. Divide DFS Used by your replication factor to get the actual size of your files stored without replication. So if your DFS used is 60 GB, and your replication factor is 3, the actual size of your files is 60/3 = 20 GB.

  3. DFS Remaining: This is the amount of storage space still available to the HDFS. If you have 150 GB remaining storage space, that mean you can store upto 150/3 = 50 GB of files without exceeding your Configured Capacity (assuming replication factor = 3).

  4. Present Capacity: The amount of storage space available for storing user files after allocating space for metadata. The difference:(Configured capacity - Present capacity) is used for storing file system metadata. and inode information.

Hope this clears it up.

like image 91
Chaos Avatar answered Dec 22 '22 10:12

Chaos


dfsadmin report does consider replication. If you want the pre-replication used bytes, use:

hdfs dfs -du -s /
like image 42
cabad Avatar answered Dec 22 '22 12:12

cabad