Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How problematic is it to read many small files from one directory?

Tags:

I have to read many (up to 5 mio.) small (9 KB) files. At the moment they are all in one directory. I fear this will take quadratic time or even n^2 log n for look up, is this right? Is this significant (will the lookup take more time than the actual reading)? Is there a difference in the asymptotic behavior of the running time when the file are cached by the OS?

I use C++-streams for reading the files. At the moment I'm using Windows 7 with NTFS, but I will later run the program on a linux cluster (not sure which file system).

like image 643
alphabetagammadeltaepsilon Avatar asked Sep 12 '16 12:09

alphabetagammadeltaepsilon


1 Answers

It might not be that bad : if you enumerate files, and process each filename as you encounter it, your OS is quite likely to have the directory entry in its disk cache. And for practical purposes, a disk cache is O(1).

What will kill you is a mechanical HDD. You'll have 5 million disk seeks, each of which takes ~1/100th of a second. That is 50.000 seconds, more than a half day. This is a task that screams for an SSD.

like image 62
MSalters Avatar answered Sep 26 '22 16:09

MSalters