I have to read many (up to 5 mio.) small (9 KB) files. At the moment they are all in one directory. I fear this will take quadratic time or even n^2 log n for look up, is this right? Is this significant (will the lookup take more time than the actual reading)? Is there a difference in the asymptotic behavior of the running time when the file are cached by the OS?
I use C++-streams for reading the files. At the moment I'm using Windows 7 with NTFS, but I will later run the program on a linux cluster (not sure which file system).
It might not be that bad : if you enumerate files, and process each filename as you encounter it, your OS is quite likely to have the directory entry in its disk cache. And for practical purposes, a disk cache is O(1).
What will kill you is a mechanical HDD. You'll have 5 million disk seeks, each of which takes ~1/100th of a second. That is 50.000 seconds, more than a half day. This is a task that screams for an SSD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With