I am trying to access Google Storage bucket from a Hadoop cluster deployed in Google Cloud using the bdutil
script. It fails if bucket access is read-only.
What am I doing:
Deploy a cluster with
bdutil deploy -e datastore_env.sh
On the master:
vgorelik@vgorelik-hadoop-m:~$ hadoop fs -ls gs://pgp-harvard-data-public 2>&1 | head -10
14/08/14 14:34:21 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.2.8-hadoop1
14/08/14 14:34:25 WARN gcsio.GoogleCloudStorage: Repairing batch of 174 missing directories.
14/08/14 14:34:26 ERROR gcsio.GoogleCloudStorage: Failed to repair some missing directories.
java.io.IOException: Multiple IOExceptions.
java.io.IOException: Multiple IOExceptions.
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createCompositeException(GoogleCloudStorageExceptions.java:61)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:361)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.createEmptyObjects(GoogleCloudStorageImpl.java:372)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.listObjectInfo(GoogleCloudStorageImpl.java:914)
at com.google.cloud.hadoop.gcsio.CacheSupplementedGoogleCloudStorage.listObjectInfo(CacheSupplementedGoogleCloudStorage.java:455)
Looking at GCS Java source code, it seems that Google Cloud Storage Connector for Hadoop needs empty "directory" objects, which it can create by its own if the bucket is writeable; otherwise it fails. Setting fs.gs.implicit.dir.repair.enable=false
leads to "Error retrieving object" error.
Is it possible to use read-only buckets as MR job input somehow?
I use gsutil
for files upload. Can it be forced to create these empty objects on file upload?
Yes, you can use a read-only Google Cloud Storage bucket as input for a Hadoop job.
For example, I have run this job many times:
./hadoop-install/bin/hadoop \
jar ./hadoop-install/contrib/streaming/hadoop-streaming-1.2.1.jar \
-input gs://pgp-harvard-data-public/hu0*/*/*/*/ASM/master* \
-mapper cgi-mapper.py -file cgi-mapper.py --numReduceTasks 0 \
-output gs://big-data-roadshow/output
This accesses the same read-only bucket you mention in your example above.
The difference between our examples is that mine ends with a glob (*
), which the Google Cloud Storage Connector for Hadoop is able to expand without needing to use any of the "placeholder" directory objects.
I recommend you use gsutil
to explore the read-only bucket you're interested in (since it doesn't need the "placeholder" objects) and once you have a glob expression that returns the list of objects you want processed, use that glob expression in your hadoop
command.
The answer to your second question ("Can gsutil
be forced to create these empty objects on file upload") is currently "no".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With