What is the maximum number of files and directories allowed in a HDFS (hadoop) directory?
Files in HDFS are broken into block-sized chunks called data blocks. These blocks are stored as independent units. The size of these HDFS data blocks is 128 MB by default.
The file or directory has separate permissions for the user that is the owner, for other users that are members of the group, and for all other users. For files, the r permission is required to read the file, and the w permission is required to write or append to the file.
Use the hdfs du command to get the size of a directory in HDFS. -x to exclude snapshots from the result. Snapshots are read only, point in time copies of a folder structure in HDFS. Usually used by Hadoop admins to preserve a copy of the files and folders at a point in time.
In modern Apache Hadoop versions, various HDFS limits are controlled by configuration properties with fs-limits
in the name, all which have reasonable default values. This question specifically asked about number of children in a directory. That's defined by dfs.namenode.fs-limits.max-directory-items
, and its default value is 1048576
.
Refer to the Apache Hadoop documentation in hdfs-default.xml for the full list of fs-limits
configuration properties and their default values. Copy-pasting here for convenience:
<property>
<name>dfs.namenode.fs-limits.max-component-length</name>
<value>255</value>
<description>Defines the maximum number of bytes in UTF-8 encoding in each
component of a path. A value of 0 will disable the check.</description>
</property>
<property>
<name>dfs.namenode.fs-limits.max-directory-items</name>
<value>1048576</value>
<description>Defines the maximum number of items that a directory may
contain. Cannot set the property to a value less than 1 or more than
6400000.</description>
</property>
<property>
<name>dfs.namenode.fs-limits.min-block-size</name>
<value>1048576</value>
<description>Minimum block size in bytes, enforced by the Namenode at create
time. This prevents the accidental creation of files with tiny block
sizes (and thus many blocks), which can degrade
performance.</description>
</property>
<property>
<name>dfs.namenode.fs-limits.max-blocks-per-file</name>
<value>1048576</value>
<description>Maximum number of blocks per file, enforced by the Namenode on
write. This prevents the creation of extremely large files which can
degrade performance.</description>
</property>
<property>
<name>dfs.namenode.fs-limits.max-xattrs-per-inode</name>
<value>32</value>
<description>
Maximum number of extended attributes per inode.
</description>
</property>
<property>
<name>dfs.namenode.fs-limits.max-xattr-size</name>
<value>16384</value>
<description>
The maximum combined size of the name and value of an extended attribute
in bytes. It should be larger than 0, and less than or equal to maximum
size hard limit which is 32768.
</description>
</property>
All of these settings use reasonable default values as decided upon by the Apache Hadoop community. It is generally recommended that users do not tune these values except in very unusual circumstances.
From http://blog.cloudera.com/blog/2009/02/the-small-files-problem/:
Every file, directory and block in HDFS is represented as an object in the namenode’s memory, each of which occupies 150 bytes, as a rule of thumb. So 10 million files, each using a block, would use about 3 gigabytes of memory. Scaling up much beyond this level is a problem with current hardware. Certainly a billion files is not feasible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With