Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest file access/storage?

Tags:

file

storage

I have about 750,000,000 files I need to store on disk. What's more is I need to be able to access these files randomly--any given file at any time--in the shortest time possible. What do I need to do to make accessing these files fastest?

Think of it like a hash table, only the hash keys are the filenames and the associated values are the files' data.

A coworker said to organize them into directories like this: if I want to store a file named "foobar.txt" and it's stored on the D: drive, put the file in "D:\f\o\o\b\a\r.\t\x\t". He couldn't explain why this was a good idea though. Is there anything to this idea?

Any ideas?

The crux of this is finding a file. What's the fastest way to find a file by name to open?

EDIT:

  • I have no control over the file system upon which this data is stored. It's going to be NTFS or FAT32.
  • Storing the file data in a database is not an option.
  • Files are going to be very small--maximum of probably 1 kb.
  • The drives are going to be solid state.
  • Data access is virtually random, but I could probably figure out a priority for each file based on how often it is requested. Some files will be accessed much more than others.
  • Items will constantly be added, and sometimes deleted.
  • It would be impractical to consolidate multiple files into single files because there's no logical association between files.
  • I would love to gather some metrics by running tests on this stuff, but that endeavour could become as consuming as the project itself!
  • EDIT2:

    I want to upvote several thorough answers, whether they're spot-on or not, and cannot because of my newbie status. Sorry guys!

    like image 425
    JamesBrownIsDead Avatar asked Nov 07 '09 06:11

    JamesBrownIsDead


    1 Answers

    This sounds like it's going to be largely a question of filesystem choice. One option to look at might be ZFS, it's designed for high volume applications.

    You may also want to consider using a relational database for this sort of thing. 750 million rows is sort of a medium size database, so any robust DBMS (eg. PostgreSQL) would be able to handle it well. You can store arbitrary blobs in the database too, so whatever you were going to store in the files on disk you can just store in the database itself.

    Update: Your additional information is certainly helpful. Given a choice between FAT32 and NTFS, then definitely choose NTFS. Don't store too many files in a single directory, 100,000 might be an upper limit to consider (although you will have to experiment, there's no hard and fast rule). Your friend's suggestion of a new directory for every letter is probably too much, you might consider breaking it up on every four letters or something. The best value to choose depends on the shape of your dataset.

    The reason breaking up the name is a good idea is that typically the performance of filesystems decreases as the number of files in a directory increases. This depends highly on the filesystem in use, for example FAT32 will be horrible with probably only a few thousand files per directory. You don't want to break up the filenames too much, so you will minimise the number of directory lookups the filesystem will have to do.

    like image 186
    Greg Hewgill Avatar answered Sep 17 '22 23:09

    Greg Hewgill