I run an image sharing website that has over 1 million images (~150GB). I'm currently storing these on a hard drive in my dedicated server, but I'm quickly running out of space, so I'd like to move them to Amazon S3.
I've tried doing an RSYNC and it took RSYNC over a day just to scan and create the list of image files. After another day of transferring, it was only 7% complete and had slowed my server down to a crawl, so I had to cancel.
Is there a better way to do this, such as GZIP them to another local hard drive and then transfer / unzip that single file?
I'm also wondering whether it makes sense to store these files in multiple subdirectories or is it fine to have all million+ files in the same directory?
One option might be to perform the migration in a lazy fashion.
This should fairly quickly get all recent or commonly fetched images moved over to Amazon and will thus reduce the load on your server. You can then add another task that migrates the others over slowly whenever the server is least busy.
Given that the files do not exist (yet) on S3, sending them as an archive file should be quicker than using a synchronization protocol.
However, compressing the archive won't help much (if at all) for image files, assuming that the image files are already stored in a compressed format such as JPEG.
Transmitting ~150 Gbytes of data is going to consume a lot of network bandwidth for a long time. This will be the same if you try to use HTTP or FTP instead of RSYNC to do the transfer. An offline transfer would be better if possible; e.g. sending a hard disc, or a set of tapes or DVDs.
Putting a million files into one flat directory is a bad idea from a performance perspective. while some file systems would cope with this fairly well with O(logN)
filename lookup times, others do not with O(N)
filename lookup. Multiply that by N
to access all files in a directory. An additional problem is that utilities that need to access files in order of file names may slow down significantly if they need to sort a million file names. (This may partly explain why rsync
took 1 day to do the indexing.)
Putting all of your image files in one directory is a bad idea from a management perspective; e.g. for doing backups, archiving stuff, moving stuff around, expanding to multiple discs or file systems, etc.
One option you could use instead of transferring the files over the network is to put them on a harddrive and ship it to amazon's import/export service. You don't have to worry about saturating your server's network connection etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With