Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I efficiently move many files to a new server?

Tags:

linux

zip

gzip

tar

I'm switching hosting providers and need to transfer millions of uploaded files to a new server. All of the files are in the same directory. Yes. You read that correctly. ;)

In the past I've done this:

  1. Zip all of the files from the source server
  2. scp the zip to the new server
  3. Unzip
  4. Move directory to appropriate location
    • for whatever reason my zips from step 1 always bring the path along with them and require me to mv.

The last time I did this it took about 4-5 days to complete and that was about 60% of what I have now.

I'm hoping for a better way. What do you suggest?

File structure is hashed. Something like this: AAAAAAAAAA.jpg - ZZZZZZZZZZ.txt

Here's one idea we're tossing around:

Split the zips into tons of mini-zips based on 3 letter prefixes. Something like:

AAAAAAAAAA.jpg - AAAZZZZZZZ.gif => AAA.zip

Theoretical Pros:

  • could speed up transfer, allowing multiple zips to transfer at once
  • could limit time lost to failed transfer. (waiting 2 days for a transfer to ultimately fail is awful)

Theoretical Cons:

  • could slow down the initial zip considerably since the zip has to look up the files through a wildcard (AAA*), perhaps offset by running many zip threads at once, using all CPUs instead of only one.
  • Complexity?

We've also thought about rsync and scp but worry about the expense of transferring each file manually. And since the remote server is empty I don't need to worry about what's already there.

What do you think? How would you do it?

(Yes, I'll be moving these to Amazon S3 eventually, and I'll just ship them a disk, but in the meantime, I need them up yesterday!)

like image 536
Ryan Avatar asked Nov 04 '12 05:11

Ryan


2 Answers

You actually have multiple options, my favorite would be using rsync.

rsync [dir1] [dir2]

This command will actually compare the directories, and sync only the differences between them.

With this, I would be most likeley to use the following

rsync -z -e ssh [email protected]:/var/www/ /var/www/

-z Zip
-e Shell Command

You could also use SFTP, FTP via SSH.

Or even wget.

wget -rc ssh://[email protected]:/var/www/
like image 53
Matt Clark Avatar answered Oct 18 '22 09:10

Matt Clark


I'm from the Linux/Unix world. I'd use tar to make a number of tar files each of a set size. E.g.:

tar -cML $MAXIMUM_FILE_SIZE_IN_KILOBYTES --file=${FILENAME}}_{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}{0,1,2,3,4,5,6,7,8,9}.tar  ${THE_FILES}

I'd skip recompression unless your .txt files are huge. You won't get much mileage of out recompressing .jpeg files, and it will eat up a lot of CPU (and real) time.

I'd look into how your traffic shaping works. How many concurrent connections can you have? How much bandwidth per connection? How much total?

I've seen some interesting things with scp. Testing out a home network, scp gave much lower throughput than copying over a mounted shared smbfs filesystem. I'm not entirely clear why. Though that may be desirable if scp is verifying the copy and requesting retransmission on errors. (There is a very small probability of an error making it through in a packet transmitted over the internet. Without some sort of subsequent verification stage that's a real problem with large data sets. You might want to run md5 hashes...)

If this is a webserver, you could always just use wget. Though that seems highly inefficient...

like image 45
TooLazyToLogIn Avatar answered Oct 18 '22 09:10

TooLazyToLogIn