Scenario Users can post an item and include up to 5 images with the post, each image that's uploaded needs to be resampled and resized - a total of 4 extra images are created. Meaning if the user uploads 5 images end up with 25 images total to store. Assumptions <ul> <li>The images have been properly checked and they're valid image files</li> <li>The system has to scale (let's assume 1000 posts in the first instance, so maximum 5000 images)</li> <li>Each image is renamed in relation to the auto_incremenet id of the db post entry and includes relevant suffix i.e. 12345_1_1.jpg 12345_2_1.jpg - so there's no issues with duplicates</li> <li>The images aren't of a sensitive nature, so there's no issues with having them directly accessible (although directory listing would be disabled)</li> </ul> Possible approaches <ul> <li>Given the ids are unique we could just drop them into one folder (ineffecient after a certain point).</li> <li>Could create a folder for each post and place all the images into that, so ROOT/images/12345 (again, would end up with a multitude of folders)</li> <li>Could do an image store based on date, i.e. each day a new folder is created and the days images are stored in there.</li> <li>Could store the images based on the resized type, i.e. all the original files could be stored in one folder images/orig all the thumbnails in images/thumb (i think Gumtree uses an approach like this).</li> <li>Could allow X amount of files to be stored in one folder before creating another one.</li> </ul> Anyone have experience on the best practices / approaches when it comes to storing images scalably? Note: I prememt someone will mention S3 - let's assume we want to keep the images locally for the time being. Thanks for looking

We have such a system in heavy production with 30,000+ files and 20+ GB to date... <pre class="prettyprint"><code> Column | Type | Modifiers -------------+-----------------------------+---------------------------------------------------------- File_ID | integer | not null default nextval('"ACRM"."File_pseq"'::regclass) CreateDate | timestamp(6) with time zone | not null default now() FileName | character varying(255) | not null default NULL::character varying ContentType | character varying(128) | not null default NULL::character varying Size | integer | not null Hash | character varying(40) | not null Indexes: "File_pkey" PRIMARY KEY, btree ("File_ID") </code></pre> The files are just stored in a single directory with the integer File_ID as the name of the file. We're over 30,000 with no problems. I've tested higher with no problems. This is using RHEL 5 x86_64 with ext3 as the file system. Would I do it this way again? No. Let me share a couple thoughts on a redesign. <ol> <li> The database is still the "master source" of information on the files. </li> <li> Each file is sha1() hashed and stored in a filesystem hierarchy based on that hash: <code>/FileData/ab/cd/abcd4548293827394723984723432987.jpg</code> </li> <li> the database is a bit smarter about storing meta-information on each file. It would be a three table system: <code>File</code> : stores info such as name, date, ip, owner, and a pointer to a Blob (sha1) <code>File_Meta</code> : stores key/value pairs on the file, depending on the type of file. This may include information such as Image_Width, etc... <code>Blob</code> : stores a reference to the sha1 along with it's size. </li> </ol> This system would de-duplicate the file content by storing the data referenced by a hash (multiple files could reference the same file data). It would be very easy to backup sync the file database using rsync. Also, the limitations of a given directory containing a lot of files would be eliminated. The file extension would be stored as part of the unique file hash. For example, if the hash for an empty file were <code>abcd8765</code>... An empty <code>.txt</code> file and empty <code>.php</code> file would refer to the same hash. Rather, they should refer to <code>abcd8765.php</code> and <code>abcd8765.txt</code>. Why? Apache, etc.. can be configured to automatically choose the content type and caching rules based on the file extension. It is important to store the files with a valid name and the extension which reflects the content of the file. You see, this system could really boost performance by delegating the file delivery through nginx. See http://wiki.nginx.org/XSendfile. I hope this helps in some way. Take care.

Effeciently storing user uploaded images on the file system [closed]

Tags:

Scenario

Users can post an item and include up to 5 images with the post, each image that's uploaded needs to be resampled and resized - a total of 4 extra images are created. Meaning if the user uploads 5 images end up with 25 images total to store.

Assumptions

The images have been properly checked and they're valid image files
The system has to scale (let's assume 1000 posts in the first instance, so maximum 5000 images)
Each image is renamed in relation to the auto_incremenet id of the db post entry and includes relevant suffix i.e. 12345_1_1.jpg 12345_2_1.jpg - so there's no issues with duplicates
The images aren't of a sensitive nature, so there's no issues with having them directly accessible (although directory listing would be disabled)

Possible approaches

Given the ids are unique we could just drop them into one folder (ineffecient after a certain point).
Could create a folder for each post and place all the images into that, so ROOT/images/12345 (again, would end up with a multitude of folders)
Could do an image store based on date, i.e. each day a new folder is created and the days images are stored in there.
Could store the images based on the resized type, i.e. all the original files could be stored in one folder images/orig all the thumbnails in images/thumb (i think Gumtree uses an approach like this).
Could allow X amount of files to be stored in one folder before creating another one.

Anyone have experience on the best practices / approaches when it comes to storing images scalably?

Note: I prememt someone will mention S3 - let's assume we want to keep the images locally for the time being.

Thanks for looking

838

asked Aug 26 '11 09:08

Rarriety

1 Answers

We have such a system in heavy production with 30,000+ files and 20+ GB to date...

   Column    |            Type             |                        Modifiers                         
-------------+-----------------------------+----------------------------------------------------------
 File_ID     | integer                     | not null default nextval('"ACRM"."File_pseq"'::regclass)
 CreateDate  | timestamp(6) with time zone | not null default now()
 FileName    | character varying(255)      | not null default NULL::character varying
 ContentType | character varying(128)      | not null default NULL::character varying
 Size        | integer                     | not null
 Hash        | character varying(40)       | not null
Indexes:
    "File_pkey" PRIMARY KEY, btree ("File_ID")

The files are just stored in a single directory with the integer File_ID as the name of the file. We're over 30,000 with no problems. I've tested higher with no problems.

This is using RHEL 5 x86_64 with ext3 as the file system.

Would I do it this way again? No. Let me share a couple thoughts on a redesign.

The database is still the "master source" of information on the files.
Each file is sha1() hashed and stored in a filesystem hierarchy based on that hash: /FileData/ab/cd/abcd4548293827394723984723432987.jpg
the database is a bit smarter about storing meta-information on each file. It would be a three table system:

File : stores info such as name, date, ip, owner, and a pointer to a Blob (sha1)
File_Meta : stores key/value pairs on the file, depending on the type of file. This may include information such as Image_Width, etc...
Blob : stores a reference to the sha1 along with it's size.

This system would de-duplicate the file content by storing the data referenced by a hash (multiple files could reference the same file data). It would be very easy to backup sync the file database using rsync.

Also, the limitations of a given directory containing a lot of files would be eliminated.

The file extension would be stored as part of the unique file hash. For example, if the hash for an empty file were abcd8765... An empty .txt file and empty .php file would refer to the same hash. Rather, they should refer to abcd8765.php and abcd8765.txt. Why?

Apache, etc.. can be configured to automatically choose the content type and caching rules based on the file extension. It is important to store the files with a valid name and the extension which reflects the content of the file.

You see, this system could really boost performance by delegating the file delivery through nginx. See http://wiki.nginx.org/XSendfile.

I hope this helps in some way. Take care.

answered Nov 06 '22 10:11

gahooa

Related questions
                            
                                If close(2) fails with EIO, will the file descriptor still be deleted?
                            
                                Can I access a base classes protected members from a static function in a derived class?
                            
                                Static Code Analyzer for C++ in Linux [duplicate]
                            
                                MongoDB preferred schema for embedded collections. documents vs. arrays
                            
                                Gzipped JSON file not decompressing
                            
                                Choosing between CharSequence and String for an API [duplicate]
                            
                                C++/CLI delegate as function pointer (System.AccessViolationException)
                            
                                eclipse import project using command line
                            
                                "computeValuesWithHarfbuzz -- need to force to single run" in Android 4: What does this mean?
                            
                                matplotlib: extended line over 2 control points [duplicate]
                            
                                Google Go and SQLite: What library to use and how? [closed]
                            
                                Can I update New in before insert trigger in sqlite?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With