I couldn't find a good title for the question, this is what I'm trying to do:
What's the best way to do this?
UPDATE:
If you want to avoid using a database, you can store them as files on disk (to keep things simple). But you need to be aware of filesystem considerations when maintaining a large number of files in a single directory.
A lot of common filesystems maintain their files per directory in some kind of sequential list (e.g., simply storing file pointers or inodes one after the other, or in linked lists.) This makes opening files that are located in the bottom of the list really slow.
A good solution is to limit your directory to a small number of nodes (say n = 1000), and create a tree of files under the directory.
So instead of storing files as:
/dir/file1 /dir/file2 /dir/file3 ... /dir/fileN
Store them as:
/dir/r1/s2/file1 /dir/r1/s2/file2 ... /dir/rM/sN/fileP
By splitting up your files this way, you improve access time significantly across most file systems.
(Note that there are some new filesystems that represent nodes in trees or other forms of indexing. This technique will work as well on those too.)
Other considerations are tuning your filesystem (block sizes, partitioning etc.) and your buffer cache such that you get good locality of data. Depending on your OS and filesystem, there are many ways to do this - you'll probably need to look them up.
Alternatively, if this doesn't cut it, you can use some kind of embedded database like SQLlite or Firebird.
HTH.
I would be tempted to use a database, in C++ either sqlite or coucheDB.
These would both work in .Net but i don't know if there is a better .Net specific alternative.
Even on filesystems that can handle 200,000 files in a directory it will take for ever to open the directory
Edit - The DB will probably be faster!
The filesystem isn't designed for huge numbers of small objects, the DB is.
It will implement all sorts of clever caching/transaction stratergies that you never thought of.
There are photo sites that chose the filesystem over a DB. But they are mostly doing reads on rather larger blobs and they have lots of admins who are expert in tuning their servers for this specific application.
I recommend making a class that has a single threaded queue for dumping images (gzipped) onto the end of a file and then saving the file offsets/meta-info into a small database like sqlite. This allows you to store all of your files quickly, tightly, from multiple threads, and read them back, efficiently and without having to deal with any filesystem quirks (other than max filesize -- which can be dealt with by having some extra metadata.
File:
file.1.gzipack
Table:
compressed_files {
id,
storage_file_id,
storage_offset,
storage_compressed_length,
mime_type,
original_file_name
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With