Suppose I have a very large number of directories (say 100.000
) in my filesystem and inside each directory there is a similar number of directories. Each directory can contain any number of files, but typically not more than a few. This structure goes to a constant depth (10
).
My question is that is there a difference in time complexity (in the reading operation) if I read in a file from this directory structure like: /dir-34/dir-215/dir-345/file1
using Paths.get()
compared to reading a file form a simple file system like this:
/dir1
/dir2
/dir3
file1
/dir4
file2
Note: This is just a theoretical question I just want to know whether the number of directories/files in the directory I'm trying to open a file from has any effect on the speed of the read operation.
Some popular filesystems use more efficient data structures than older filesystems. ext4 has directory hashing turned on by default (as pointed out by @ninjalj), as does XFS. This means lookups in a single directory are expected to take O(1)
on average (so constant time if your path has a fixed maximum number of subdirectories). This follows the performance of the hash function itself.
Even if you have zillions of files per directory, accessing a single file is very fast -- but only if you have the full path. If you do NOT have the full path, and instead have to look through a directory for a pattern, you are faced with O(n)
on the number of entries in the directory. That's further exacerbated by a small read size (32k) for the default system-level directory reading calls.
(While ext4
directories can have huge numbers of files, they are limited to 64000 subdirectory entries.)
If the /path/to/file
is available, (Note: still the performance and time complexity will largely depend on the on-disk structures and the underlying file system implementation. Ex btrfs, everything is b-tree, ext4 and XFS use H-trees)
Therefore for traversing the directory structure until the leaf node (directory which contains the file), the average case time complexity should be O(logN), while worst case would be O(N), N = no of directories in the tree. The worst case is when you have Nth directory created under N-1, and N-1th directory created in N-2, and so on ... until the root directory, forming a single branch in the tree. Ideally you don't have to traverse through all the directories in the tree from the root if you have the full path.
Then if your underlying FS supports directory indexes and hashing, each lookup would require another O(1) in finding the file within the directory. Therefore, O(logN) + O(1), i.e ignoring the lower order terms it should be only O(logN), where N is the level.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With