Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NTFS directory has 100K entries. How much performance boost if spread over 100 subdirectories?

Context We have a homegrown filesystem-backed caching library. We currently have performance problems with one installation due to large number of entries (e.g. up to 100,000). The problem: we store all fs entries in one "cache directory". Very large directories perform poorly.

We're looking at spreading those entries over subdirectories--as git does, e.g. 100 subdirectories with ~ 1,000 entries each.

The question

I understand that smaller directories sizes will help with filesystem access.

But will "spreading into subdirectories" speed up traversing all entries, e.g. enumerating/reading all 100,000 entries? I.e. When we initialize/warm the cache from the FS store, we need to traversing all 100,000 entries (and deleting old entries) can take 10+ minutes.

Will "spreading the data" decrease this "traversal time". Additionally this "traversal" actually can/does delete stale entries (e.g older then N days) Will "spreading the data" improve delete times?

Additional Context -NTFS -Windows Family OS (Server 2003, 2008)

-Java J2ee application.

I/we would appreciate any schooling on filesystem scalability issues.

Thanks in advance.

will

p.s. I should comment that I have the tools and ability to test this myself, but figured I'd pick the hive mind for the theory and experience first.

like image 583
user331465 Avatar asked Dec 05 '10 01:12

user331465


1 Answers

I also believed that spreading files across subdirectories will speed-up operations.

So I conducted the tests: I've generated files from AAAA to ZZZZ (26^4 files, it's about 450K) and placed them into one NTFS directory. I also placed the identical files to subdirectories from AA to ZZ (i.e. grouped files by first 2 letters of their names). Then I performed some tests - enumeration and random access. I rebooted the system after creation and between tests.

Flat structure exposed slightly better performance than subdirectories. I believe this is because the directories are cached and NTFS indexes directory contents, so lookup is fast.

Note, that full enumeration (in both cases) took about 3 minutes for 400K files. This is significant time, but subdirectories make it even worse.

Conclusion: on NTFS in particular it makes no sense to group files into subdirectories if access is possible to any of those files. If you have a cache, I would also test grouping the files by date or by domain, assuming that some files are accessed more frequently than others, and the OS doesn't need to keep all directories in memory. However, for your number of files (under 100K) this probably wouldn't provide significant benefits either. You need to measure such specific scenarios yourself, I think.

Update: I've reduced my test for random access to only access half of the files (from AA to OO). The assumption was that this will involve one flat directory and only half of subdirectories (giving a bonus to subdirectory case). Still flat directory performed better. So I assume that unless you have millions of files, keeping them in one flat directory on NTFS will be faster than grouping them into subdirectories.

like image 116
Eugene Mayevski 'Callback Avatar answered Sep 26 '22 02:09

Eugene Mayevski 'Callback