Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop cannot connect to Google Cloud Storage

I'm trying to connect Hadoop running on Google Cloud VM to Google Cloud Storage. I have:

  • Modified the core-site.xml to include properties of fs.gs.impl and fs.AbstractFileSystem.gs.impl
  • Downloaded and referenced the gcs-connector-latest-hadoop2.jar in a generated hadoop-env.sh
  • authenticated via gcloud auth login using my personal account (instead of a service account).

I'm able to run gsutil -ls gs://mybucket/ without any issues but when I execute

hadoop fs -ls gs://mybucket/

I get the output:

14/09/30 23:29:31 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.9-hadoop2 

ls: Error getting access token from metadata server at: http://metadata/computeMetadata/v1/instance/service-accounts/default/token

Wondering what steps I am missing to get Hadoop to be able to see the Google Storage?

Thanks!

like image 391
Denny Lee Avatar asked Sep 30 '14 23:09

Denny Lee


People also ask

What is GCP in Hadoop?

GCP's Cloud Dataproc is a managed by Hadoop and Spark environment. Users can use Cloud Dataproc to run most of their existing jobs with minimal changes, so the users don't need to alter all of the Hadoop tools they already know.

Can Hdfs be on cloud?

HDFS compatibility with equivalent (or better) performance.You can access Cloud Storage data from your existing Hadoop or Spark jobs simply by using the gs:// prefix instead of hfds:://. In most workloads, Cloud Storage actually provides the same or better performance than HDFS on Persistent Disk.

Does not have storage objects create access to the Google cloud storage object forbidden?

The error you get means that your Cloud Function service account is lacking the storage. objects. create permission. In order to fix it, you can either give your service account a predefined role like Storage Object Creator or create a custom role with that permission.


2 Answers

By default, the gcs-connector when running on Google Compute Engine is optimized for using the built-in service-account mechanisms, so in order to force it to use the oauth2 flow, there are a few extra configuration keys that need to be set; you can borrow the same "client_id" and "client_secret" from gcloud auth as follows and add them to your core-site.xml, also disabling fs.gs.auth.service.account.enable:

<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>false</value>
</property>
<property>
  <name>fs.gs.auth.client.id</name>
  <value>32555940559.apps.googleusercontent.com</value>
</property>
<property>
  <name>fs.gs.auth.client.secret</name>
  <value>ZmssLNjJy2998hD4CTg2ejr2</value>
</property>

You can optionally also set fs.gs.auth.client.file to something other than its default of ~/.credentials/storage.json.

If you do this, then when you run hadoop fs -ls gs://mybucket you'll see a new prompt, similar to the "gcloud auth login" prompt, where you'll visit a browser and enter a verification code again. Unfortunately, the connector can't quite consume a "gcloud" generated credential directly, even though it can possibly share a credentialstore file, since it asks explicitly for the GCS scopes that it needs (you'll notice that the new auth flow will ask only for GCS scopes, as opposed to a big list of services like "gcloud auth login").

Make sure you've also set fs.gs.project.id in your core-site.xml:

<property>
  <name>fs.gs.project.id</name>
  <value>your-project-id</value>
</property>

since the GCS connector likewise doesn't automatically infer a default project from the related gcloud auth.

like image 107
Dennis Huo Avatar answered Sep 21 '22 12:09

Dennis Huo


Thanks very much for both of your answers! Your answers led me to the configuration as noted in Migrating 50TB data from local Hadoop cluster to Google Cloud Storage.

I was able to utilize the fs.gs.auth.service.account.keyfile by generating a new service account and then applying the service account email address and p12 key.

like image 45
Denny Lee Avatar answered Sep 24 '22 12:09

Denny Lee