I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>
. My understanding is that, if gsutil
is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory>
ten minutes apart and compare them with diff
, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!
A resumable upload allows you to resume data transfer operations to Cloud Storage after a communication failure has interrupted the flow of data. Resumable uploads work by sending multiple requests, each of which contains a portion of the object you're uploading.
gsutil is a Python application that lets you access Cloud Storage from the command line. You can use gsutil to do a wide range of bucket and object management tasks, including: Creating and deleting buckets. Uploading, downloading, and deleting objects. Listing buckets and objects.
You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With