Hadoop cannot connect to Google Cloud Storage

Tags:

I'm trying to connect Hadoop running on Google Cloud VM to Google Cloud Storage. I have:

Modified the core-site.xml to include properties of fs.gs.impl and fs.AbstractFileSystem.gs.impl
Downloaded and referenced the gcs-connector-latest-hadoop2.jar in a generated hadoop-env.sh
authenticated via gcloud auth login using my personal account (instead of a service account).

I'm able to run gsutil -ls gs://mybucket/ without any issues but when I execute

hadoop fs -ls gs://mybucket/

I get the output:

14/09/30 23:29:31 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2 

ls: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token

Wondering what steps I am missing to get Hadoop to be able to see the Google Storage?

Thanks!

391

asked Sep 30 '14 23:09

Denny Lee

2 Answers

By default, the gcs-connector when running on Google Compute Engine is optimized for using the built-in service-account mechanisms, so in order to force it to use the oauth2 flow, there are a few extra configuration keys that need to be set; you can borrow the same "client_id" and "client_secret" from gcloud auth as follows and add them to your core-site.xml, also disabling fs.gs.auth.service.account.enable:

<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>false</value>
</property>
<property>
  <name>fs.gs.auth.client.id</name>
  <value>32555940559.apps.googleusercontent.com</value>
</property>
<property>
  <name>fs.gs.auth.client.secret</name>
  <value>ZmssLNjJy2998hD4CTg2ejr2</value>
</property>

You can optionally also set fs.gs.auth.client.file to something other than its default of ~/.credentials/storage.json.

If you do this, then when you run hadoop fs -ls gs://mybucket you'll see a new prompt, similar to the "gcloud auth login" prompt, where you'll visit a browser and enter a verification code again. Unfortunately, the connector can't quite consume a "gcloud" generated credential directly, even though it can possibly share a credentialstore file, since it asks explicitly for the GCS scopes that it needs (you'll notice that the new auth flow will ask only for GCS scopes, as opposed to a big list of services like "gcloud auth login").

Make sure you've also set fs.gs.project.id in your core-site.xml:

<property>
  <name>fs.gs.project.id</name>
  <value>your-project-id</value>
</property>

since the GCS connector likewise doesn't automatically infer a default project from the related gcloud auth.

107

answered Sep 21 '22 12:09

Dennis Huo

Thanks very much for both of your answers! Your answers led me to the configuration as noted in Migrating 50TB data from local Hadoop cluster to Google Cloud Storage.

I was able to utilize the fs.gs.auth.service.account.keyfile by generating a new service account and then applying the service account email address and p12 key.

answered Sep 24 '22 12:09

Denny Lee

Related questions
                            
                                NDB query filtering by property (string)
                            
                                Use Google AppEngine SDK to run my application on a private server
                            
                                Create entity using datastore viewer Google App Engine
                            
                                Failed to start devlopment server -- BindError: Unable to find a consistent port localhost
                            
                                GAE SDK 1.9.4 breaks GPE 3.5.1
                            
                                Objectify loads object behind Ref<?> even when @Load is not specified
                            
                                pyOpenSSL NotImplementedError Google App Engine
                            
                                Can't find dev_appserver.py with gcloud installation
                            
                                detect error: "popup_blocked_by_browser" for google auth2 in javascript
                            
                                Does Google App Engine Flex support Pipfile?
                            
                                Wildcard search on Appengine in python
                            
                                Is it possible to find and delete orphaned blobs in the app engine blobstore?
                            
                                Lightweight & Fast REST Server library for java?
                            
                                What alternatives are there to numpy on Google App Engine?
                            
                                App Engine - Subdomain
                            
                                Robots.txt in AppEngine Java
                            
                                Multiple file upload using GWT and AppEngine Blobstore?
                            
                                ClientId not updated when deploying - User injected null
                            
                                Unable to generate cloud endpoint class
                            
                                How do I enable SSL for custom domains on appengine?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hadoop cannot connect to Google Cloud Storage

Tags:

google-app-engine

hadoop

google-cloud-storage

google-hadoop

Denny Lee

People also ask

2 Answers

Dennis Huo

Denny Lee

Recent Activity

Donate For Us