Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The way to check a HDFS directory's size?

People also ask

What is the size of HDFS?

The default size of the HDFS data block is 128 MB. If blocks are small, there will be too many blocks in Hadoop HDFS and thus too much metadata to store. Managing such a huge number of blocks and metadata will create overhead and lead to traffic in a network.

How do I list a directory in HDFS?

The following arguments are available with hadoop ls command: Usage: hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args> Options: -d: Directories are listed as plain files. -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864). -R: Recursively list subdirectories encountered.


Prior to 0.20.203, and officially deprecated in 2.6.0:

hadoop fs -dus [directory]

Since 0.20.203 (dead link) 1.0.4 and still compatible through 2.6.0:

hdfs dfs -du [-s] [-h] URI [URI …]

You can also run hadoop fs -help for more info and specifics.


hadoop fs -du -s -h /path/to/dir displays a directory's size in readable form.


Extending to Matt D and others answers, the command can be till Apache Hadoop 3.0.0

hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]

It displays sizes of files and directories contained in the given directory or the length of a file in case it's just a file.

Options:

  • The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, the calculation is done by going 1-level deep from the given path.
  • The -h option will format file sizes in a human-readable fashion (e.g 64.0m instead of 67108864)
  • The -v option will display the names of columns as a header line.
  • The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.

du returns three columns with the following format:

 +-------------------------------------------------------------------+ 
 | size  |  disk_space_consumed_with_all_replicas  |  full_path_name | 
 +-------------------------------------------------------------------+ 

Example command:

hadoop fs -du /user/hadoop/dir1 \
    /user/hadoop/file1 \
    hdfs://nn.example.com/user/hadoop/dir1 

Exit Code: Returns 0 on success and -1 on error.

source: Apache doc


With this you will get size in GB

hdfs dfs -du PATHTODIRECTORY | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'

When trying to calculate the total of a particular group of files within a directory the -s option does not work (in Hadoop 2.7.1). For example:

Directory structure:

some_dir
├abc.txt    
├count1.txt 
├count2.txt 
└def.txt    

Assume each file is 1 KB in size. You can summarize the entire directory with:

hdfs dfs -du -s some_dir
4096 some_dir

However, if I want the sum of all files containing "count" the command falls short.

hdfs dfs -du -s some_dir/count*
1024 some_dir/count1.txt
1024 some_dir/count2.txt

To get around this I usually pass the output through awk.

hdfs dfs -du some_dir/count* | awk '{ total+=$1 } END { print total }'
2048 

To get the size of the directory hdfs dfs -du -s -h /$yourDirectoryName can be used. hdfs dfsadmin -report can be used to see a quick cluster level storage report.


hadoop version 2.3.33:

hadoop fs -dus  /path/to/dir  |   awk '{print $2/1024**3 " G"}' 

enter image description here