Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google cloud: Using gsutil to download data from AWS S3 to GCS

One of our collaborators has made some data available on AWS and I was trying to get it into our google cloud bucket using gsutil (only some of the files are of use to us, so I don't want to use the GUI provided on GCS). The collaborators have provided us with the AWS bucket ID, the aws access key id, and aws secret access key id.

I looked through the documentation on GCE and editied the ~/.botu file such that the access keys are incorporated. I restarted my terminal and tried to do an 'ls' but got the following error:

gsutil ls s3://cccc-ffff-03210/
AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied

Do I need to configure/run something else too?

thanks!

EDITS:

Thanks for the replies!

I installed the Cloud SDK and I can access and run all gsutil commands on my google cloud storage project. My problem is in trying to access (e.g. 'ls' command) the amazon S3 that is being shared with me.


  1. I uncommented two lines in the ~/.boto file and put the access keys:


    # To add HMAC aws credentials for "s3://" URIs, edit and uncomment the
    # following two lines:
    aws_access_key_id = my_access_key
    aws_secret_access_key = my_secret_access_key
    

  1. Output of 'gsutil version -l':


    | => gsutil version -l
    
    my_gc_id
    gsutil version: 4.27
    checksum: 5224e55e2df3a2d37eefde57 (OK)
    boto version: 2.47.0
    python version: 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1                                                 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)]
    OS: Darwin 15.4.0
    multiprocessing available: True
    using cloud sdk: True
    pass cloud sdk credentials to gsutil: True
    config path(s): /Users/pc/.boto, /Users/pc/.config/gcloud/legacy_credentials/[email protected]/.boto
    gsutil path: /Users/pc/Documents/programs/google-cloud-        sdk/platform/gsutil/gsutil
    compiled crcmod: True
    installed via package manager: False
    editable install: False
    

  1. The output with the -DD option is:


    => gsutil -DD ls s3://my_amazon_bucket_id
    
    multiprocessing available: True
    using cloud sdk: True
    pass cloud sdk credentials to gsutil: True
    config path(s): /Users/pc/.boto, /Users/pc/.config/gcloud/legacy_credentials/[email protected]/.boto
    gsutil path: /Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gsutil
    compiled crcmod: True
    installed via package manager: False
    editable install: False
    Command being run: /Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gsutil -o GSUtil:default_project_id=my_gc_id -DD ls s3://my_amazon_bucket_id
    config_file_list: ['/Users/pc/.boto', '/Users/pc/.config/gcloud/legacy_credentials/[email protected]/.boto']
    config: [('debug', '0'), ('working_dir', '/mnt/pyami'), ('https_validate_certificates', 'True'), ('debug', '0'), ('working_dir', '/mnt/pyami'), ('content_language', 'en'), ('default_api_version', '2'), ('default_project_id', 'my_gc_id')]
    DEBUG 1103 08:42:34.664643 provider.py] Using access key found in shared credential file.
    DEBUG 1103 08:42:34.664919 provider.py] Using secret key found in shared credential file.
    DEBUG 1103 08:42:34.665841 connection.py] path=/
    DEBUG 1103 08:42:34.665967 connection.py] auth_path=/my_amazon_bucket_id/
    DEBUG 1103 08:42:34.666115 connection.py] path=/?delimiter=/
    DEBUG 1103 08:42:34.666200 connection.py] auth_path=/my_amazon_bucket_id/?delimiter=/
    DEBUG 1103 08:42:34.666504 connection.py] Method: GET
    DEBUG 1103 08:42:34.666589 connection.py] Path: /?delimiter=/
    DEBUG 1103 08:42:34.666668 connection.py] Data: 
    DEBUG 1103 08:42:34.666724 connection.py] Headers: {}
    DEBUG 1103 08:42:34.666776 connection.py] Host: my_amazon_bucket_id.s3.amazonaws.com
    DEBUG 1103 08:42:34.666831 connection.py] Port: 443
    DEBUG 1103 08:42:34.666882 connection.py] Params: {}
    DEBUG 1103 08:42:34.666975 connection.py] establishing HTTPS connection: host=my_amazon_bucket_id.s3.amazonaws.com, kwargs={'port': 443, 'timeout': 70}
    DEBUG 1103 08:42:34.667128 connection.py] Token: None
    DEBUG 1103 08:42:34.667476 auth.py] StringToSign:
    GET
    
    
    Fri, 03 Nov 2017 12:42:34 GMT
    /my_amazon_bucket_id/
    DEBUG 1103 08:42:34.667600 auth.py] Signature:
    AWS RN8=
    DEBUG 1103 08:42:34.667705 connection.py] Final headers: {'Date': 'Fri, 03 Nov 2017 12:42:34 GMT', 'Content-Length': '0', 'Authorization': u'AWS AK6GJQ:EFVB8F7rtGN8=', 'User-Agent': 'Boto/2.47.0 Python/2.7.10 Darwin/15.4.0 gsutil/4.27 (darwin) google-cloud-sdk/164.0.0'}
    DEBUG 1103 08:42:35.179369 https_connection.py] wrapping ssl socket; CA certificate file=/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/third_party/boto/boto/cacerts/cacerts.txt
    DEBUG 1103 08:42:35.247599 https_connection.py] validating server certificate: hostname=my_amazon_bucket_id.s3.amazonaws.com, certificate hosts=['*.s3.amazonaws.com', 's3.amazonaws.com']
    send: u'GET /?delimiter=/ HTTP/1.1\r\nHost: my_amazon_bucket_id.s3.amazonaws.com\r\nAccept-Encoding: identity\r\nDate: Fri, 03 Nov 2017 12:42:34 GMT\r\nContent-Length: 0\r\nAuthorization: AWS AN8=\r\nUser-Agent: Boto/2.47.0 Python/2.7.10 Darwin/15.4.0 gsutil/4.27 (darwin) google-cloud-sdk/164.0.0\r\n\r\n'
    reply: 'HTTP/1.1 403 Forbidden\r\n'
    header: x-amz-bucket-region: us-east-1
    header: x-amz-request-id: 60A164AAB3971508
    header: x-amz-id-2: +iPxKzrW8MiqDkWZ0E=
    header: Content-Type: application/xml
    header: Transfer-Encoding: chunked
    header: Date: Fri, 03 Nov 2017 12:42:34 GMT
    header: Server: AmazonS3
    DEBUG 1103 08:42:35.326652 connection.py] Response headers: [('date', 'Fri, 03 Nov 2017 12:42:34 GMT'), ('x-amz-id-2', '+iPxKz1dPdgDxpnWZ0E='), ('server', 'AmazonS3'), ('transfer-encoding', 'chunked'), ('x-amz-request-id', '60A164AAB3971508'), ('x-amz-bucket-region', 'us-east-1'), ('content-type', 'application/xml')]
    DEBUG 1103 08:42:35.327029 bucket.py] <?xml version="1.0" encoding="UTF-8"?>
    <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>6097164508</RequestId><HostId>+iPxKzrWWZ0E=</HostId></Error>
    DEBUG: Exception stack trace:
    Traceback (most recent call last):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/__main__.py", line 577, in _RunNamedCommandAndHandleExceptions
        collect_analytics=True)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/command_runner.py", line 317, in RunNamedCommand
        return_code = command_inst.RunCommand()
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/commands/ls.py", line 548, in RunCommand
        exp_dirs, exp_objs, exp_bytes = ls_helper.ExpandUrlAndPrint(storage_url)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/ls_helper.py", line 180, in ExpandUrlAndPrint
        print_initial_newline=False)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/ls_helper.py", line 252, in _RecurseExpandUrlAndPrint
        bucket_listing_fields=self.bucket_listing_fields):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/wildcard_iterator.py", line 476, in IterAll
        expand_top_level_buckets=expand_top_level_buckets):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/wildcard_iterator.py", line 157, in __iter__
        fields=bucket_listing_fields):
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 413, in ListObjects
        self._TranslateExceptionAndRaise(e, bucket_name=bucket_name)
      File "/Users/pc/Documents/programs/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 1471, in _TranslateExceptionAndRaise
        raise translated_exception
    AccessDeniedException: AccessDeniedException: 403 AccessDenied
    
    
    AccessDeniedException: 403 AccessDenied
    
like image 902
bsmith Avatar asked Oct 30 '17 22:10

bsmith


2 Answers

I'll assume that you are able to set up gcloud credentials using gcloud init and gcloud auth login or gcloud auth activate-service-account, and can list/write objects to GCS successfully.

From there, you need two things. A properly configured AWS IAM role applied to the AWS user you're using, and a properly configured ~/.boto file.

AWS S3 IAM policy for bucket access

A policy like this must be applied, either by a role granted to your user or an inline policy attached to the user.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::some-s3-bucket/*",
                "arn:aws:s3:::some-s3-bucket"
            ]
        }
    ]
}

The important part is that you have ListBucket and GetObject actions, and the resource scope for these includes at least the bucket (or prefix thereof) that you wish to read from.

.boto file configuration

Interoperation between service providers is always a bit tricky. At the time of this writing, in order to support AWS Signature V4 (the only one supported universally by all AWS regions), you have to add a couple extra properties to your ~/.boto file beyond just credential, in an [s3] group.

[Credentials]
aws_access_key_id = [YOUR AKID]
aws_secret_access_key = [YOUR SECRET AK]
[s3]
use-sigv4=True
host=s3.us-east-2.amazonaws.com

The use-sigv4 property cues Boto, via gsutil, to use AWS Signature V4 for requests. Currently, this requires the host be specified in the configuration, unfortunately. It is pretty easy to figure the host name out, as it follows the pattern of s3.[BUCKET REGION].amazonaws.com.

If you have rsync/cp work from multiple S3 regions, you could handle it a few ways. You can set an environment variable like BOTO_CONFIG before running the command to change between multiple files. Or, you can override the setting on each run using a top-level argument, like:

gsutil -o s3:host=s3.us-east-2.amazonaws.com ls s3://some-s3-bucket

Edit:

Just want to add... another cool way to do this job is rclone.

like image 147
Dom Zippilli Avatar answered Sep 25 '22 17:09

Dom Zippilli


1. Generate your GCS credentials

If you download the Cloud SDK, then run gcloud init and gcloud auth login, gcloud should configure the OAuth2 credentials for the account you logged in with, allowing you to access your GCS bucket (it does this by creating a boto file that gets loaded in addition to your ~/.boto file, if it exists).

If you're using standalone gsutil, run gsutil config to generate a config file at ~/.boto.

2. Add your AWS credentials to the file ~/.boto

The [Credentials] section of your ~/.boto file should have these two lines populated and uncommented:

aws_access_key_id = IDHERE
aws_secret_access_key = KEYHERE

If you've done that:

  • Make sure that you didn't accidentally swap the values for key and id.
  • Verify you're loading the correct boto file(s) - you can do this by running gsutil version -l and looking for the "config path(s):" line.
  • If you still receive a 403, it's possible that they've given you either the wrong bucket name, or a key and id corresponding to an account that doesn't have permission to list the contents of that bucket.
like image 22
mhouglum Avatar answered Sep 24 '22 17:09

mhouglum