Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Moving 1 million image files to Amazon S3

I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.

I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.

Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?

I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?

like image 770
makeee Avatar asked Jan 17 '11 02:01

makeee


3 Answers

One option might be to perform the migration in a lazy fashion.

  • All new images go to Amazon S3.
  • Any requests for images not yet on Amazon trigger a migration of that one image to Amazon S3. (queue it up)

This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.

like image 141
Ian Mercer Avatar answered Oct 17 '22 12:10

Ian Mercer


  1. Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.

  2. However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.

  3. Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.

  4. Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with O(logN) filename lookup times, others do not with O(N) filename lookup. Multiply that by N to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain why rsync took 1 day to do the indexing.)

  5. Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.

like image 5
Stephen C Avatar answered Oct 17 '22 12:10

Stephen C


One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.

like image 4
GWW Avatar answered Oct 17 '22 14:10

GWW