Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Accessing read-only Google Storage buckets from Hadoop

I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil script. It fails if bucket access is read-only.

What am I doing:

  1. Deploy a cluster with

    bdutil deploy -e datastore_env.sh
    
  2. On the master:

    vgorelik@vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
    14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
    14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
    14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
    java.io.IOException: Multiple IOExceptions.
    java.io.IOException: Multiple IOExceptions.
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
        at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
        at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
    

Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false leads to "Error retrieving object" error.

Is it possible to use read-only buckets as MR job input somehow?

I use gsutil for files upload. Can it be forced to create these empty objects on file upload?

like image 508
Victor Gorelik Avatar asked Aug 14 '14 14:08

Victor Gorelik


1 Answers

Yes, you can use a read-only Google Cloud Storage bucket as input for a Hadoop job.

For example, I have run this job many times:

./hadoop-install/bin/hadoop \
  jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
  -input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
  -mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
  -output gs://big-data-roadshow/output

This accesses the same read-only bucket you mention in your example above.

The difference between our examples is that mine ends with a glob (*), which the Google Cloud Storage Connector for Hadoop is able to expand without needing to use any of the "placeholder" directory objects.

I recommend you use gsutil to explore the read-only bucket you're interested in (since it doesn't need the "placeholder" objects) and once you have a glob expression that returns the list of objects you want processed, use that glob expression in your hadoop command.

The answer to your second question ("Can gsutil be forced to create these empty objects on file upload") is currently "no".

like image 92
Paul Newson Avatar answered Sep 26 '22 00:09

Paul Newson