If i have a site where users can upload as many images as they want(think photobucket-like), what is the best way to set up file storage (also, all uploads get a unique random timestamp)?
site root
--username
----image1.jpg
----image2.jpg
----image3.jpg
--anotheruser
----image1.jpg
----image2.jpg
----image3.jpg
...
or
siteroot
--uploads
----image1.jpg
----image2.jpg
----image3.jpg
----image4.jpg
----image6.jpg
...
----image50000.jpg
I think the first method is more organized. But i think the second method is standard(keeping all uploads in the same dir), but i wonder if it would be slower when retrieving an image if there are thousands of image in the same directory
--- edit ---
Thanks for the great answers so far. Also, i will be creating thumbnails, so i also would have to insert that directory somewhere... or, create a naming convention such as thumb_whatever.jpg.
so many different ways to do this. Yes disk space will be a problem. but for now i am concerned with retrieval time. When i have to output an image to the browser, if that image is in a directory with 10,000 other images, i am worried on how slow that could get.
You can put 4,294,967,295 files into a single folder if drive is formatted with NTFS (would be unusual if it were not) as long as you do not exceed 256 terabytes (single file size and space) or all of disk space that was available whichever is less.
A scale-out file server is frequently the choice for saving image data as primary storage. Meanwhile, object storage, the public cloud and tape drives are all appropriate targets for data protection and archiving.
Generally databases are best for data and the file system is best for files. It depends what you're planning to do with the image though. If you're storing images for a web page then it's best to store them as a file on the server. The web server will very quickly find an image file and send it to a visitor.
The number of files in a directory should have no effect at all on the time required to read a file's data - but it can massively affect the amount of time needed to find the file before you can start to read it.
The exact breakpoints where the major issues start up will vary from filesystem type to filesystem type, but, in general, if you're talking about a few hundred files, you don't much need to worry about it. If you're talking about a few thousand, it's worth thinking about and maybe doing a little benchmarking to see how your filesystem and hardware handle it. If you're talking about tens of thousands of files, then you really need to start breaking things up. (I once had a Linux/e2fs print server where CUPS wasn't deleting its job control files after it finished printing and it got up around 100,000 files in one directory. Just getting a directory listing took over half an hour before it even started to display any filenames.)
Separating them by user name may not be the best choice, though, since you'll likely have a lot of users uploading very few images and perhaps a couple who upload hundreds or thousands of images, potentially creating access time issues in those users' storage directories. The bigger problem in that scenario is that you'd likely end up (assuming a successful site) with thousands or tens of thousands of users and a large number of subdirectories is just as bad as a large number of files for slowing down access to your data.
Since you're going to have a timestamp on them, what I would probably do is put them into subdirectories based on the last three digits of the timestamp. That will distribute the files relatively evenly across 1000 subdirectories and should keep the number of files in each directory reasonably small. (Using the first three digits would cause one directory to be filled before moving to the next instead of distributing them evenly.) If you're still ending up with too many files in each subdirectory (which would likely mean you're dealing with several million uploaded images), you could add a second level for the previous three digits, so upload-1234567890.jpg would end up at /567/890/upload-1234567890.jpg.
The answer to that is "maybe". It's possible the file retrieval may be fine, but if you need to do any maintenance on the folder, it would be a huge headache as processes attempt ot enumerate the directory listings.
What would improve the situation would be a number of sub directories under the images folder (or two levels, depending on how many images you're looking at storing), so you have a hierarchy like this:
siteroot
-- uploads
---- a
---- b
---- c
:
---- z
...and then store files based on their first letter (so all images with names starting 'a' go into the folder 'a'). You could have this as a two or three letter suffix (aa, ab, ac, ad ..., ba, bb, bc ..., zx, zy, zz) and possibly have a hierarchy under that as well so you split files across a number of folders dependent on the first four characters of the name.
If files are then assigned a random alpha-numeric name then this would ensure files are spread evenly across all the folders (given a large enough sample size).
You might want to consider a mix of your option (1) and splitting images over a hierarchy as I've described above. That would ensure that if a single user does upload lots of files, then you're covered. Similarly, if you're looking at a lot of user directories, the same principle applies to ensure you don't have 1,000,000 user directories under a single parent.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With