Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ML Engine Batch Prediction running on wrong python version

enter image description here

So I have a tensorflow model in python 3.5 registered with the ML engine and I want to run a batch prediction job using it. My API request body looks like:

{
  "versionName": "XXXXX/v8_0QSZ",
  "dataFormat": "JSON",
  "inputPaths": [
    "XXXXX"
  ],
  "outputPath": "XXXXXX",
  "region": "us-east1",
  "runtimeVersion": "1.12",
  "accelerator": {
    "count": "1",
    "type": "NVIDIA_TESLA_P100"
  }
}

Then the batch prediction job runs and returns "Job completed successfully.", however, it was completely unsuccessful and consistently threw the following error for each input:

Exception during running the graph: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node convolution_layer/conv1d/conv1d/Conv2D (defined at /usr/local/lib/python2.7/dist-packages/google/cloud/ml/prediction/frameworks/tf_prediction_lib.py:210) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](convolution_layer/conv1d/conv1d/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer, convolution_layer/conv1d/conv1d/ExpandDims_1)]] [[{{node Cast_6/_495}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_789_Cast_6", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]] 

My questions are:

  • Why does the batch job report success when in reality it completely failed?
  • In the exception above it mentions python 2.7... yet the model is registered as python 3.5 and there is no way to specify the python version using the API. Why is the batch prediction using 2.7?
  • What in general can I do to make this work?
  • Does this have anything to do with my accelerator option?
like image 447
Andrew Cassidy Avatar asked Nov 07 '22 20:11

Andrew Cassidy


1 Answers

Response from batch prediction dev: "we don't officially support Python 3 yet. However, the issue you're encountering is a known bug affecting our GPU runtimes for TF 1.11 and 1.12

like image 141
Andrew Cassidy Avatar answered Nov 15 '22 09:11

Andrew Cassidy