Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the time complexity of reading a file from a Linux filesystem?

Suppose I have a very large number of directories (say 100.000) in my filesystem and inside each directory there is a similar number of directories. Each directory can contain any number of files, but typically not more than a few. This structure goes to a constant depth (10).

My question is that is there a difference in time complexity (in the reading operation) if I read in a file from this directory structure like: /dir-34/dir-215/dir-345/file1 using Paths.get() compared to reading a file form a simple file system like this:

/dir1
  /dir2
  /dir3
    file1
  /dir4
    file2

Note: This is just a theoretical question I just want to know whether the number of directories/files in the directory I'm trying to open a file from has any effect on the speed of the read operation.

like image 598
Adam Arold Avatar asked Dec 18 '14 17:12

Adam Arold


2 Answers

Some popular filesystems use more efficient data structures than older filesystems. ext4 has directory hashing turned on by default (as pointed out by @ninjalj), as does XFS. This means lookups in a single directory are expected to take O(1) on average (so constant time if your path has a fixed maximum number of subdirectories). This follows the performance of the hash function itself.

Even if you have zillions of files per directory, accessing a single file is very fast -- but only if you have the full path. If you do NOT have the full path, and instead have to look through a directory for a pattern, you are faced with O(n) on the number of entries in the directory. That's further exacerbated by a small read size (32k) for the default system-level directory reading calls.

(While ext4 directories can have huge numbers of files, they are limited to 64000 subdirectory entries.)

like image 70
Allen Luce Avatar answered Oct 15 '22 06:10

Allen Luce


If the /path/to/file is available, (Note: still the performance and time complexity will largely depend on the on-disk structures and the underlying file system implementation. Ex btrfs, everything is b-tree, ext4 and XFS use H-trees)

Therefore for traversing the directory structure until the leaf node (directory which contains the file), the average case time complexity should be O(logN), while worst case would be O(N), N = no of directories in the tree. The worst case is when you have Nth directory created under N-1, and N-1th directory created in N-2, and so on ... until the root directory, forming a single branch in the tree. Ideally you don't have to traverse through all the directories in the tree from the root if you have the full path.

Then if your underlying FS supports directory indexes and hashing, each lookup would require another O(1) in finding the file within the directory. Therefore, O(logN) + O(1), i.e ignoring the lower order terms it should be only O(logN), where N is the level.

like image 35
askb Avatar answered Oct 15 '22 08:10

askb