I am using Google Cloud to train a neural network on the cloud like in the following example:
https://cloud.google.com/blog/big-data/2016/12/how-to-classify-images-with-tensorflow-using-google-cloud-machine-learning-and-cloud-dataflow
To start I set the following to environmental variables:
PROJECT_ID=$(gcloud config list project --format "value(core.project)")
BUCKET_NAME=${PROJECT_ID}-mlengine
I then uploaded my training and evaluation data, both csv's with the names eval_set.csv and train_set.csv to Google cloud storage with the following command:
gsutil cp -r data gs://$BUCKET_NAME
I then verified that these two csv files where in the polar-terminal-160506-mlengine/data directory on my Google Cloud storage.
I then did the following environmental variable assignments
# Assign appropriate values.
PROJECT=$(gcloud config list project --format "value(core.project)")
JOB_ID="flowers_${USER}_$(date +%Y%m%d_%H%M%S)"
GCS_PATH="${BUCKET}/${USER}/${JOB_ID}"
DICT_FILE=gs://cloud-ml-data/img/flower_photos/dict.txt
Before trying to preprocess my evaluation data like so:
# Preprocess the eval set.
python trainer/preprocess.py \
--input_dict "$DICT_FILE" \
--input_path "gs://cloud-ml-data/img/flower_photos/eval_set.csv" \
--output_path "${GCS_PATH}/preproc/eval" \
--cloud
Sadly, this runs for a bit and then crashes outputting the following error:
ValueError: Unable to get the Filesystem for path gs://polar-terminal-160506-mlengine/data/eval_set.csv
This doesn't seem possible as I have confirmed with my eyes via my Google Cloud Storage console that eval_set.csv is stored at this location. Is this perhaps a permissions issue or something I am not seeing?
Edit:
I have found the cause of this run time error to be from a certain line in the trainer.preprocess.py file. The line is this one:
read_input_source = beam.io.ReadFromText(
opt.input_path, strip_trailing_newlines=True)
Seems like a pretty good clue but I am still not really sure what is going on. When I google "beam.io.ReadFromText ValueError: Unable to get the Filesystem for path" nothing relevant at all appears which is a bit odd. Thoughts?
Try pip install apache_beam[gcp]
. This will help you.
It looks like your apache-beam library installation might be incomplete.
try pip install apache-beam[gcp]
It allows apache beam to access files stored on Google Cloud Storage.
Apache Beam package available here
Just as Jean-Christophe described, I believe your installation is incomplete.
The apache-beam
package doesn't include all the stuff to read/write from GCP. To get all that, as well as the runner for being able to deploy your pipeline to CloudDataflow (the DataRunner
), you'll need to install it via pip
.
pip install google-cloud-dataflow
This is how I was able to resolve the same issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With