Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to store many files in disk

I couldn't find a good title for the question, this is what I'm trying to do:

  • This is .NET application.
  • I need to store up to 200000 objects (between 3KB-500KB)
  • I need to store about 10 of them per second from multiple-threads
  • I use binaryserialization before storing it
  • I need to access them later on by an integer, unique id

What's the best way to do this?

  • I can't keep them on memory as I'll get outofmemory exceptions
  • When I store them in the disk as separate files what are the possible performance issues? Would it decrease the overall performance much?
  • Shall I implement some sort of caching, for example combine 100 objects and write it once as one file. Then parse them later on. Or something similar?
  • Shall use a database? (access time is not important, there won't be search and I'll access only couple of times by the known unique id). In theory I don't need a database, I don't want to complicate this.

UPDATE:

  • I assume database would be slower than file system, prove me wrong if you got something about that. So that's why I'm also leaning towards to file system. But what I'm truly worried is about writing 200KB*10 per second to HDD (this can be any HDD, I don't control hardware, it's a desktop tool which will be deployed in different systems).
  • If I use file system I'll store files in separate folders to avoid file-system related issues (so you can ignore that limitation)
like image 347
dr. evil Avatar asked Feb 09 '10 14:02

dr. evil


3 Answers

If you want to avoid using a database, you can store them as files on disk (to keep things simple). But you need to be aware of filesystem considerations when maintaining a large number of files in a single directory.

A lot of common filesystems maintain their files per directory in some kind of sequential list (e.g., simply storing file pointers or inodes one after the other, or in linked lists.) This makes opening files that are located in the bottom of the list really slow.

A good solution is to limit your directory to a small number of nodes (say n = 1000), and create a tree of files under the directory.

So instead of storing files as:

/dir/file1 /dir/file2 /dir/file3 ... /dir/fileN

Store them as:

/dir/r1/s2/file1 /dir/r1/s2/file2 ... /dir/rM/sN/fileP

By splitting up your files this way, you improve access time significantly across most file systems.

(Note that there are some new filesystems that represent nodes in trees or other forms of indexing. This technique will work as well on those too.)

Other considerations are tuning your filesystem (block sizes, partitioning etc.) and your buffer cache such that you get good locality of data. Depending on your OS and filesystem, there are many ways to do this - you'll probably need to look them up.

Alternatively, if this doesn't cut it, you can use some kind of embedded database like SQLlite or Firebird.

HTH.

like image 177
0xfe Avatar answered Oct 11 '22 16:10

0xfe


I would be tempted to use a database, in C++ either sqlite or coucheDB.
These would both work in .Net but i don't know if there is a better .Net specific alternative.

Even on filesystems that can handle 200,000 files in a directory it will take for ever to open the directory

Edit - The DB will probably be faster!
The filesystem isn't designed for huge numbers of small objects, the DB is.
It will implement all sorts of clever caching/transaction stratergies that you never thought of.

There are photo sites that chose the filesystem over a DB. But they are mostly doing reads on rather larger blobs and they have lots of admins who are expert in tuning their servers for this specific application.

like image 40
Martin Beckett Avatar answered Oct 11 '22 15:10

Martin Beckett


I recommend making a class that has a single threaded queue for dumping images (gzipped) onto the end of a file and then saving the file offsets/meta-info into a small database like sqlite. This allows you to store all of your files quickly, tightly, from multiple threads, and read them back, efficiently and without having to deal with any filesystem quirks (other than max filesize -- which can be dealt with by having some extra metadata.

File:
file.1.gzipack

Table:
compressed_files {
  id,
  storage_file_id,
  storage_offset,
  storage_compressed_length,
  mime_type,
  original_file_name
}
like image 23
Nthalk Avatar answered Oct 11 '22 17:10

Nthalk