Accessing read-only Google Storage buckets from Hadoop

Question

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only.

What am I doing:

Deploy a cluster with
```
bdutil deploy -e datastore_env.sh
```

On the master:

vgorelik@vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
java.io.IOException: Multiple IOExceptions.
java.io.IOException: Multiple IOExceptions.
    at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
    at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
    at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
    at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
    at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)

Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false leads to "Error retrieving object" error.

Is it possible to use read-only buckets as MR job input somehow?

I use gsutil for files upload. Can it be forced to create these empty objects on file upload?

Paul Newson · Accepted Answer

Yes, you can use a read-only Google Cloud Storage bucket as input for a Hadoop job.

For example, I have run this job many times:

./hadoop-install/bin/hadoop \
  jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
  -input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
  -mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
  -output gs://big-data-roadshow/output

This accesses the same read-only bucket you mention in your example above.

The difference between our examples is that mine ends with a glob (*), which the Google Cloud Storage Connector for Hadoop is able to expand without needing to use any of the "placeholder" directory objects.

I recommend you use gsutil to explore the read-only bucket you're interested in (since it doesn't need the "placeholder" objects) and once you have a glob expression that returns the list of objects you want processed, use that glob expression in your hadoop command.

The answer to your second question ("Can gsutil be forced to create these empty objects on file upload") is currently "no".

Accessing read-only Google Storage buckets from Hadoop

Tags:

google-cloud-platform

hadoop

google-cloud-storage

gsutil

google-hadoop

Victor Gorelik

1 Answers

Paul Newson

Recent Activity

Donate For Us

Accessing read-only Google Storage buckets from Hadoop

Tags:

google-cloud-platform

hadoop

google-cloud-storage

gsutil

google-hadoop

Victor Gorelik

1 Answers

Paul Newson

Related questions

Recent Activity

Donate For Us