Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

On resume gsutil seems to re-upload files

I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.

It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)

For example:

<  404104811  2014-04-08T14:13:44Z  gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
>  404104811  2014-04-08T14:43:48Z  gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2

The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.

Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!

like image 716
MPBall Avatar asked Apr 08 '14 15:04

MPBall


People also ask

What is a resumable upload?

A resumable upload allows you to resume data transfer operations to Cloud Storage after a communication failure has interrupted the flow of data. Resumable uploads work by sending multiple requests, each of which contains a portion of the object you're uploading.

What is gsutil command?

gsutil is a Python application that lets you access Cloud Storage from the command line. You can use gsutil to do a wide range of bucket and object management tasks, including: Creating and deleting buckets. Uploading, downloading, and deleting objects. Listing buckets and objects.


1 Answers

You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.

gsutil cp -Rn <disk-top-directory> <bucket>

From the help (gsutil help cp)

-n            No-clobber. When specified, existing files or objects at the
              destination will not be overwritten. Any items that are skipped
              by this option will be reported as being skipped. This option
              will perform an additional HEAD request to check if an item
              exists before attempting to upload the data. This will save
              retransmitting data, but the additional HTTP requests may make
              small object transfers slower and more expensive.

Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.

like image 184
IanGSY Avatar answered Oct 02 '22 11:10

IanGSY