When a user uploads an image to my site, the image goes through this process;
So far the site is pretty small, and there are only ~200,000 images in the uploads directory. I realise I'm nowhere near the physical limit of files within a directory, but this approach clearly won't scale, so I was wondering if anyone had any advice on upload / storage strategies for handling large volumes of image uploads.
EDIT: Creating username (or more specifically, userid) subfolders would seem to be a good solution. With a bit more digging, I've found some great info right here; How to store images in your filesystem
However, would this userid dir approach scale well if a CDN is bought into the equation?
Store the images as a file in the file system and create a record in a table with the exact path to that image. Or, store the image itself in a table using an "image" or "binary data" data type of the database server.
Storing images in a database table is not recommended. There are too many disadvantages to this approach. Storing the image data in the table requires the database server to process and traffic huge amounts of data that could be better spent on processing it is best suited to.
Generally databases are best for data and the file system is best for files. It depends what you're planning to do with the image though. If you're storing images for a web page then it's best to store them as a file on the server. The web server will very quickly find an image file and send it to a visitor.
I've answered a similar question before but I can't find it, maybe the OP deleted his question...
Anyway, Adams solution seems to be the best so far, yet it isn't bulletproof since images/c/cf/
(or any other dir/subdir pair) could still contain up to 16^30 unique hashes and at least 3 times more files if we count image extensions, a lot more than any regular file system can handle.
AFAIK, SourceForge.net also uses this system for project repositories, for instance the "fatfree" project would be placed at projects/f/fa/fatfree/
, however I believe they limit project names to 8 chars.
I would store the image hash in the database along with a DATE
/ DATETIME
/ TIMESTAMP
field indicating when the image was uploaded / processed and then place the image in a structure like this:
images/ 2010/ - Year 04/ - Month 19/ - Day 231c2ee287d639adda1cdb44c189ae93.png - Image Hash
Or:
images/ 2010/ - Year 0419/ - Month & Day (12 * 31 = 372) 231c2ee287d639adda1cdb44c189ae93.png - Image Hash
Besides being more descriptive, this structure is enough to host hundreds of thousands (depending on your file system limits) of images per day for several thousand years, this is the way Wordpress and others do it, and I think they got it right on this one.
Duplicated images could be easily queried on the database and you'd just have to create symlinks.
Of course, if this is not enough for you, you can always add more subdirs (hours, minutes, ...).
Personally I wouldn't use user IDs unless you don't have that info available in your database, because:
Regarding the CDN I don't see any reason this scheme (or any other) wouldn't work...
MediaWiki generates the MD5 sum of the name of the uploaded file, and uses the first two letters of the MD5 (say, "c" and "f" of the sum "cf1e66b77918167a6b6b972c12b1c00d") to create this directory structure:
images/c/cf/Whatever_filename.png
You could also use the image ID for a predictable upper limit on the number of files per directory. Maybe take floor(image unique ID / 1000)
to determine the parent directory, for 1000 images per directory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With