Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unable to use GPU to train a NN model in azure machine learning service using P100-NC6s-V2 compute. Fails wth CUDA error

I’ve recently started working with azure for ML and am trying to use machine learning service workspace. I’ve set up a workspace with the compute set to NC6s-V2 machines since I need train a NN using images on GPU.

The issue is that the training still happens on the CPU – the logs say it’s not able to find CUDA. Here’s the warning log when running my script. Any clues how to solve this issue?

I’ve also mentioned explicitly tensorflow-gpu package in the conda packages option of the estimator.

Here's my code for the estimator,

script_params = {
         '--input_data_folder': ds.path('dataset').as_mount(),
         '--zip_file_name': 'train.zip',
         '--run_mode': 'train'
    }


est = Estimator(source_directory='./scripts',
                     script_params=script_params,
                     compute_target=compute_target,
                     entry_script='main.py',
                     conda_packages=['scikit-image', 'keras', 'tqdm', 'pillow', 'matplotlib', 'scipy', 'tensorflow-gpu']
                     )

run = exp.submit(config=est)

run.wait_for_completion(show_output=True)

The compute target was made as per the sample code on github:

compute_name = "P100-NC6s-V2"
compute_min_nodes = 0
compute_max_nodes = 4

vm_size = "STANDARD_NC6S_V2"

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
                                                                min_nodes=compute_min_nodes,
                                                                max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(
        ws, compute_name, provisioning_config)

    # can poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

This is the warning with which it fails to use the GPU:

2019-08-12 14:50:16.961247: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55a7ce570830 executing computations on platform Host. Devices:
2019-08-12 14:50:16.961278: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-12 14:50:16.971025: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/lib:/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/mic/lib:/azureml-envs/azureml_5fdf05c5671519f307e0f43128b8610e/lib:
2019-08-12 14:50:16.971054: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2019-08-12 14:50:16.971081: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: 4bd815dfb0e74e3da901861a4746184f000000
2019-08-12 14:50:16.971089: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: 4bd815dfb0e74e3da901861a4746184f000000
2019-08-12 14:50:16.971164: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2019-08-12 14:50:16.971202: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.40.4
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
2019-08-12 14:50:16.973301: I tensorflow/core/common_runtime/direct_session.cc:296] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device

It's currently using the CPU as per the logs. Any clues how to resolve the issue here?

like image 792
SageEx Avatar asked Nov 07 '22 14:11

SageEx


1 Answers

Instead of base Estimator, you can use the Tensorflow Estimator with Keras and other libraries layered on top. That way you don't have to worry about setting up and configuring the GPU libraries, as the Tensorflow Estimator uses a Docker image with GPU libraries pre-configured.

See here for documentation:

API Reference You can use conda_packages argument to specify additional libraries. Also set argument use_gpu = True.

Example Notebook

like image 164
Roope Astala - MSFT Avatar answered Nov 15 '22 11:11

Roope Astala - MSFT