The default size of the HDFS data block is 128 MB. If blocks are small, there will be too many blocks in Hadoop HDFS and thus too much metadata to store. Managing such a huge number of blocks and metadata will create overhead and lead to traffic in a network.
The following arguments are available with hadoop ls command: Usage: hadoop fs -ls [-d] [-h] [-R] [-t] [-S] [-r] [-u] <args> Options: -d: Directories are listed as plain files. -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864). -R: Recursively list subdirectories encountered.
Prior to 0.20.203, and officially deprecated in 2.6.0:
hadoop fs -dus [directory]
Since 0.20.203 (dead link) 1.0.4 and still compatible through 2.6.0:
hdfs dfs -du [-s] [-h] URI [URI …]
You can also run hadoop fs -help
for more info and specifics.
hadoop fs -du -s -h /path/to/dir
displays a directory's size in readable form.
Extending to Matt D and others answers, the command can be till Apache Hadoop 3.0.0
hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]
It displays sizes of files and directories contained in the given directory or the length of a file in case it's just a file.
Options:
- The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, the calculation is done by going 1-level deep from the given path.
- The -h option will format file sizes in a human-readable fashion (e.g 64.0m instead of 67108864)
- The -v option will display the names of columns as a header line.
- The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.
du
returns three columns with the following format: +-------------------------------------------------------------------+
| size | disk_space_consumed_with_all_replicas | full_path_name |
+-------------------------------------------------------------------+
hadoop fs -du /user/hadoop/dir1 \
/user/hadoop/file1 \
hdfs://nn.example.com/user/hadoop/dir1
Exit Code: Returns 0 on success and -1 on error.
source: Apache doc
With this you will get size in GB
hdfs dfs -du PATHTODIRECTORY | awk '/^[0-9]+/ { print int($1/(1024**3)) " [GB]\t" $2 }'
When trying to calculate the total of a particular group of files within a directory the -s
option does not work (in Hadoop 2.7.1). For example:
Directory structure:
some_dir
├abc.txt
├count1.txt
├count2.txt
└def.txt
Assume each file is 1 KB in size. You can summarize the entire directory with:
hdfs dfs -du -s some_dir
4096 some_dir
However, if I want the sum of all files containing "count" the command falls short.
hdfs dfs -du -s some_dir/count*
1024 some_dir/count1.txt
1024 some_dir/count2.txt
To get around this I usually pass the output through awk.
hdfs dfs -du some_dir/count* | awk '{ total+=$1 } END { print total }'
2048
To get the size of the directory hdfs dfs -du -s -h /$yourDirectoryName can be used. hdfs dfsadmin -report can be used to see a quick cluster level storage report.
hadoop version 2.3.33:
hadoop fs -dus /path/to/dir | awk '{print $2/1024**3 " G"}'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With