Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Building a file upload site that scales

I'm attempting to build a file upload site as a side project, and I've never built anything that needed to handle a large amount of files like this. As far as I can tell, there are three major options for storing and retrieving the files (note that there can be multiple files per upload, so, for example, website.com/a23Fc may let you download a single or multiple files, depending on how many the user originally uploaded - similar to imgur.com):

  • Stick all the files in one huge files directory, and use a (relational) DB to figure out which files belong to which URLs, then return a list of filenames depending on that. Example: user loads website.com/abcde, so it queries the DB for all files related to the abcde uploads, returns their filenames, and the site outputs those.
  • Use CouchDB because it allows you to actually attach files to individual records in the DB, so each URL/upload could be a DB record with files attached to it. Example, user loads website.com/abcde, CouchDB grabs the document with the ID of abcde, grabs the files attached to that document, and gives them to the user.
  • Skip out on using a DB completely, and for each upload, create a new directory and stick the files in that. Example: user loads website.com/abcde, site looks for a /files/abcde/ directory, grabs all the files out of there, and gives them to the user, so a database isn't involved at all.

Which of these seems to most scalable? Like I said, I have very little experience in this area so if I'm completely off or if there is an obvious 4th option, I'm more than open to it. Having thousands or millions of files in a single directory (i.e., option 1) doesn't seem very smart, but having thousands or millions of directories in a directory (i.e., option 3) doesn't seem much better.

like image 678
Mike Crittenden Avatar asked Oct 24 '22 19:10

Mike Crittenden


1 Answers

A company I used to work for faced this exact problem with about a petabyte of image files. Their solution was to use the Andrew File System (see http://en.wikipedia.org/wiki/Andrew_File_System for more) to store the files in a directory structure that matched the URL structure. This scaled very well in practice.

They also recorded the existence of the files in a database for other reasons that were internal to their application.

like image 62
btilly Avatar answered Oct 27 '22 11:10

btilly