Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Effeciently storing user uploaded images on the file system [closed]

Tags:

Scenario

Users can post an item and include up to 5 images with the post, each image that's uploaded needs to be resampled and resized - a total of 4 extra images are created. Meaning if the user uploads 5 images end up with 25 images total to store.

Assumptions

  • The images have been properly checked and they're valid image files
  • The system has to scale (let's assume 1000 posts in the first instance, so maximum 5000 images)
  • Each image is renamed in relation to the auto_incremenet id of the db post entry and includes relevant suffix i.e. 12345_1_1.jpg 12345_2_1.jpg - so there's no issues with duplicates
  • The images aren't of a sensitive nature, so there's no issues with having them directly accessible (although directory listing would be disabled)

Possible approaches

  • Given the ids are unique we could just drop them into one folder (ineffecient after a certain point).
  • Could create a folder for each post and place all the images into that, so ROOT/images/12345 (again, would end up with a multitude of folders)
  • Could do an image store based on date, i.e. each day a new folder is created and the days images are stored in there.
  • Could store the images based on the resized type, i.e. all the original files could be stored in one folder images/orig all the thumbnails in images/thumb (i think Gumtree uses an approach like this).
  • Could allow X amount of files to be stored in one folder before creating another one.

Anyone have experience on the best practices / approaches when it comes to storing images scalably?

Note: I prememt someone will mention S3 - let's assume we want to keep the images locally for the time being.

Thanks for looking

like image 838
Rarriety Avatar asked Aug 26 '11 09:08

Rarriety


People also ask

What's the best way to store user uploaded images?

Store the images as a file in the file system and create a record in a table with the exact path to that image. Or, store the image itself in a table using an "image" or "binary data" data type of the database server.

Is it better to store images in database or filesystem?

Generally databases are best for data and the file system is best for files. It depends what you're planning to do with the image though. If you're storing images for a web page then it's best to store them as a file on the server. The web server will very quickly find an image file and send it to a visitor.

What is the best way to store images in a database?

The only way to insert images into database is through embedded SQL programming.


1 Answers

We have such a system in heavy production with 30,000+ files and 20+ GB to date...

   Column    |            Type             |                        Modifiers                         
-------------+-----------------------------+----------------------------------------------------------
 File_ID     | integer                     | not null default nextval('"ACRM"."File_pseq"'::regclass)
 CreateDate  | timestamp(6) with time zone | not null default now()
 FileName    | character varying(255)      | not null default NULL::character varying
 ContentType | character varying(128)      | not null default NULL::character varying
 Size        | integer                     | not null
 Hash        | character varying(40)       | not null
Indexes:
    "File_pkey" PRIMARY KEY, btree ("File_ID")

The files are just stored in a single directory with the integer File_ID as the name of the file. We're over 30,000 with no problems. I've tested higher with no problems.

This is using RHEL 5 x86_64 with ext3 as the file system.

Would I do it this way again? No. Let me share a couple thoughts on a redesign.

  1. The database is still the "master source" of information on the files.

  2. Each file is sha1() hashed and stored in a filesystem hierarchy based on that hash: /FileData/ab/cd/abcd4548293827394723984723432987.jpg

  3. the database is a bit smarter about storing meta-information on each file. It would be a three table system:

    File : stores info such as name, date, ip, owner, and a pointer to a Blob (sha1)
    File_Meta : stores key/value pairs on the file, depending on the type of file. This may include information such as Image_Width, etc...
    Blob : stores a reference to the sha1 along with it's size.

This system would de-duplicate the file content by storing the data referenced by a hash (multiple files could reference the same file data). It would be very easy to backup sync the file database using rsync.

Also, the limitations of a given directory containing a lot of files would be eliminated.

The file extension would be stored as part of the unique file hash. For example, if the hash for an empty file were abcd8765... An empty .txt file and empty .php file would refer to the same hash. Rather, they should refer to abcd8765.php and abcd8765.txt. Why?

Apache, etc.. can be configured to automatically choose the content type and caching rules based on the file extension. It is important to store the files with a valid name and the extension which reflects the content of the file.

You see, this system could really boost performance by delegating the file delivery through nginx. See http://wiki.nginx.org/XSendfile.

I hope this helps in some way. Take care.

like image 99
gahooa Avatar answered Nov 06 '22 10:11

gahooa