I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?
Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.
A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?
Any ideas?
The crux of this is finding a file. What's the fastest way to find a file by name to open?
EDIT:
EDIT2:
I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!
This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.
You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.
Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.
The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With