Scenario
Users can post an item and include up to 5 images with the post, each image that's uploaded needs to be resampled and resized - a total of 4 extra images are created. Meaning if the user uploads 5 images end up with 25 images total to store.
Assumptions
Possible approaches
Anyone have experience on the best practices / approaches when it comes to storing images scalably?
Note: I prememt someone will mention S3 - let's assume we want to keep the images locally for the time being.
Thanks for looking
Store the images as a file in the file system and create a record in a table with the exact path to that image. Or, store the image itself in a table using an "image" or "binary data" data type of the database server.
Generally databases are best for data and the file system is best for files. It depends what you're planning to do with the image though. If you're storing images for a web page then it's best to store them as a file on the server. The web server will very quickly find an image file and send it to a visitor.
The only way to insert images into database is through embedded SQL programming.
We have such a system in heavy production with 30,000+ files and 20+ GB to date...
Column | Type | Modifiers
-------------+-----------------------------+----------------------------------------------------------
File_ID | integer | not null default nextval('"ACRM"."File_pseq"'::regclass)
CreateDate | timestamp(6) with time zone | not null default now()
FileName | character varying(255) | not null default NULL::character varying
ContentType | character varying(128) | not null default NULL::character varying
Size | integer | not null
Hash | character varying(40) | not null
Indexes:
"File_pkey" PRIMARY KEY, btree ("File_ID")
The files are just stored in a single directory with the integer File_ID as the name of the file. We're over 30,000 with no problems. I've tested higher with no problems.
This is using RHEL 5 x86_64 with ext3 as the file system.
Would I do it this way again? No. Let me share a couple thoughts on a redesign.
The database is still the "master source" of information on the files.
Each file is sha1() hashed and stored in a filesystem hierarchy based on that hash:
/FileData/ab/cd/abcd4548293827394723984723432987.jpg
the database is a bit smarter about storing meta-information on each file. It would be a three table system:
File
: stores info such as name, date, ip, owner, and a pointer to a Blob (sha1)File_Meta
: stores key/value pairs on the file, depending on the type of file. This may include information such as Image_Width, etc...Blob
: stores a reference to the sha1 along with it's size.
This system would de-duplicate the file content by storing the data referenced by a hash (multiple files could reference the same file data). It would be very easy to backup sync the file database using rsync.
Also, the limitations of a given directory containing a lot of files would be eliminated.
The file extension would be stored as part of the unique file hash. For example, if the hash for an empty file were abcd8765
... An empty .txt
file and empty .php
file would refer to the same hash. Rather, they should refer to abcd8765.php
and abcd8765.txt
. Why?
Apache, etc.. can be configured to automatically choose the content type and caching rules based on the file extension. It is important to store the files with a valid name and the extension which reflects the content of the file.
You see, this system could really boost performance by delegating the file delivery through nginx. See http://wiki.nginx.org/XSendfile.
I hope this helps in some way. Take care.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With