Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is gsutil rsync re-downloading all our files?

We've been using gsutil -m rsync -r to keep dev and deploy boxes in sync with a GCS bucket for nearly 2 years without any problem. There are about 85k objects in the bucket.

Until recently, this worked perfectly: we'd run a deploy-box -> GCS rsync every 15 mins or so, to keep all new uploaded resource backed up, and then a GCS -> dev box rsync whenever we wanted to refresh the local dev data (running on OSX El Capitan).

Within the last couple of months, though, the GCS->dev rsync has started to bloat, downloading more and more images.

Initially I just thought "great, we're getting more resources uploaded", but it's been growing way faster than the data, until today when it seems to be downloading the whole 85k images.

I've double-checked I'm in the right place, the command is correct, the paths are correct, etc. For all that the gsutil output is scrolling by with reams and reams of "Copying..." and "Downloading..." messages, making good parallel use of our 100mbps connection, when I go to another terminal and run find . -type f | wc -l on the destination directory every 10 seconds, it shows that barely 2 or 3 new files are being added a minute. I look at modification times on files that gsutil says it's downloading right now and in the large majority they're old, plenty haven't changed in a year or more. Meaning: it's downloading all the data, using tons of time and bandwidth, all for the sake of a few hundred files.

Has something changed in recent OSX gsutil versions? Is there possibly a bug? How would I even start to go about tracking this down? Or reporting it? The newsgroups gsutil-discuss and gs-discussion have been archived, and the talk in gce-discussion is all about using gsutil from GCE instances.

Thanks!

like image 539
Igor Clark Avatar asked Aug 18 '16 11:08

Igor Clark


Video Answer


2 Answers

I had a similar issue where the same files were synced over and over. I don't have that many files so you might need to check for performance but I decided to use the -c option to force using the checksum instead of mtime which was modified locally in my build process. I think (and hope) the documentation is slightly wrong stating that

compare checksums for files if the size of source and destination as well as mtime match

as it seems to use checksum even if mtime does not match

like image 128
alextk Avatar answered Sep 19 '22 20:09

alextk


gsutil 4.20 (released 2016-07-20) modified the change detection algorithm for rsync. Instead of comparing only the size of the local file with its cloud counterpart, it now compares both the size and file modification time of local files. The file modification time is stored in the custom user metadata for the file when it is uploaded with rsync. If that doesn't exist the object creation time is used.

like image 33
Travis Hobrla Avatar answered Sep 18 '22 20:09

Travis Hobrla