Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What should I do about this gsutil "parallel composite upload" warning?

Tags:

python

gsutil

I am running a python script and using the os library to execute a gsutil command, which is typically executed in the command prompt on Windows. I have some file on my local computer and I want to put it into a Google Bucket so I do:

import os

command = 'gsutil -m cp myfile.csv  gs://my/bucket/myfile.csv'
os.system(command)

I get a message like:

==> NOTE: You are uploading one or more large file(s), which would run significantly faster if you enable parallel composite uploads. This feature can be enabled by editing the "parallel_composite_upload_threshold" value in your .boto configuration file. However, note that if you do this large files will be uploaded as 'composite objects https://cloud.google.com/storage/docs/composite-objects'_, which means that any user who downloads such objects will need to have a compiled crcmod installed (see "gsutil help crcmod"). This is because without a compiled crcmod, computing checksums on composite objects is so slow that gsutil disables downloads of composite objects.

I want to get rid of this message either by hiding it if it's irrelevant od actually doing what it suggests, but I can't find the .boto file. What should I do?

like image 975
user1367204 Avatar asked Oct 31 '17 19:10

user1367204


2 Answers

The Parallel Composite Uploads section of the documentation for gsutil describes how to resolve this (assuming, as the warning specifies, that this content will be used by clients with the crcmod module available):

gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp bigfile gs://your-bucket

To do this safely from Python would look like:

filename='myfile.csv'
gs_bucket='my/bucket'
parallel_threshold='150M' # minimum size for parallel upload; 0 to disable

subprocess.check_call([
  'gsutil',
  '-o', 'GSUtil:parallel_composite_upload_threshold=%s' % (parallel_threshold,),
  'cp', filename, 'gs://%s/%s' % (gs_bucket, filename)
])

Note that here you're explicitly providing argument vector boundaries, and not relying on a shell to do this for you; this prevents a malicious or buggy filename from performing undesired operations.


If you don't know that the clients accessing content in this bucket will have the crcmod module, consider setting parallel_threshold='0' above, which will disable this support.

like image 193
Charles Duffy Avatar answered Nov 05 '22 20:11

Charles Duffy


Another way is to set the configuration that the prompt says inside a file in the BOTO_PATH. usually $HOME/.boto.

[GSUtil]
parallel_composite_upload_threshold = 150M

For max speed install the crcmod C library

like image 37
fabrizioM Avatar answered Nov 05 '22 20:11

fabrizioM