Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use Nvidia Multi-process Service (MPS) to run multiple non-MPI CUDA applications?

Can I run non-MPI CUDA applications concurrently on NVIDIA Kepler GPUs with MPS? I'd like to do this because my applications cannot fully utilize the GPU, so I want them to co-run together. Is there any code example to do this?

like image 530
dalibocai Avatar asked Jan 10 '16 19:01

dalibocai


People also ask

How do I use Nvidia CUDA MPS control?

Starting MPS control daemonexport CUDA_VISIBLE_DEVICES=0 # Select GPU 0. nvidia-cuda-mps-control -d # Start the daemon. This will start the MPS control daemon that will spawn a new MPS Server instance for that $UID starting an application and associate it with GPU visible to the control daemon.

What is Nvidia CUDA MPS control?

DESCRIPTION. MPS is a runtime service designed to let multiple MPI processes using CUDA to run concurrently on a single GPU in a way that's transparent to the MPI program. A CUDA program runs in MPS mode if the MPS control daemon is running on the system.

Can a GPU run multiple processes?

CUDA MPS is a feature that allows multiple CUDA processes to share a single GPU context. each process receive some subset of the available connections to that GPU. MPS allows overlapping of kernel and memcopy operations from different processes on the GPU to achieve maximum utilization.

Can MPI be used for GPU?

Regular MPI implementations pass pointers to host memory, staging GPU buffers through host memory using cudaMemcopy. With Kepler class and later GPUs & Hyper-Q, multiple MPI processes can share the GPU.


1 Answers

The necessary instructions are contained in the documentation for the MPS service. You'll note that those instructions don't really depend on or call out MPI, so there really isn't anything MPI-specific about them.

Here's a walkthrough/example.

  1. Read section 2.3 of the above-linked documentation for various requirements and restrictions. I recommend using CUDA 7, 7.5, or later for this. There were some configuration differences with prior versions of CUDA MPS that I won't cover here. Also, I'll demonstrate just using a single server/single GPU. The machine I am using for test is a CentOS 6.2 node using a K40c (cc3.5/Kepler) GPU, with CUDA 7.0. There are other GPUs in the node. In my case, the CUDA enumeration order places my K40c at device 0, but the nvidia-smi enumeration order happens to place it as id 2 in the order. All of these details matter in a system with multiple GPUs, impacting the scripts given below.

  2. I'll create several helper bash scripts and also a test application. For the test application, we'd like something with kernel(s) that can obviously run concurrently with kernels from other instances of the application, and we'd also like something that makes it obvious when those kernels (from separate apps/processes) are running concurrently or not. To meet these needs for demonstration purposes, let's have an app that has a kernel that just runs in a single thread on a single SM, and simply waits for a period of time (we'll use ~5 seconds) before exiting and printing a message. Here's a test app that does that:

    $ cat t1034.cu
    #include <stdio.h>
    #include <stdlib.h>
    
    #define MAX_DELAY 30
    
    #define cudaCheckErrors(msg) \
      do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
      } while (0)
    
    
    #include <time.h>
    #include <sys/time.h>
    #define USECPSEC 1000000ULL
    
    unsigned long long dtime_usec(unsigned long long start){
    
      timeval tv;
      gettimeofday(&tv, 0);
      return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
    }
    
    #define APPRX_CLKS_PER_SEC 1000000000ULL
    __global__ void delay_kernel(unsigned seconds){
    
      unsigned long long dt = clock64();
      while (clock64() < (dt + (seconds*APPRX_CLKS_PER_SEC)));
    }
    
    int main(int argc, char *argv[]){
    
      unsigned delay_t = 5; // seconds, approximately
      unsigned delay_t_r;
      if (argc > 1) delay_t_r = atoi(argv[1]);
      if ((delay_t_r > 0) && (delay_t_r < MAX_DELAY)) delay_t = delay_t_r;
      unsigned long long difft = dtime_usec(0);
      delay_kernel<<<1,1>>>(delay_t);
      cudaDeviceSynchronize();
      cudaCheckErrors("kernel fail");
      difft = dtime_usec(difft);
      printf("kernel duration: %fs\n", difft/(float)USECPSEC);
      return 0;
    }
    
    
    $ nvcc -arch=sm_35 -o t1034 t1034.cu
    $ ./t1034
    kernel duration: 6.528574s
    $
    
  3. We'll use a bash script to start the MPS server:

    $ cat start_as_root.bash
    #!/bin/bash
    # the following must be performed with root privilege
    export CUDA_VISIBLE_DEVICES="0"
    nvidia-smi -i 2 -c EXCLUSIVE_PROCESS
    nvidia-cuda-mps-control -d
    $
    
  4. And a bash script to launch 2 copies of our test app "simultaneously":

    $ cat mps_run
    #!/bin/bash
    ./t1034 &
    ./t1034
    $
    
  5. We could also have a bash script to shut down the server, although it's not needed for this walkthrough:

    $ cat stop_as_root.bash
    #!/bin/bash
    echo quit | nvidia-cuda-mps-control
    nvidia-smi -i 2 -c DEFAULT
    $
    
  6. Now when we just launch our test app using the mps_run script above, but without actually enabling the MPS server, we get the expected behavior that one instance of the app takes the expected ~5 seconds, whereas the other instance takes approximately double that (~10 seconds) because, since it does not run concurrently with an app from another process, it waits for 5 seconds while the other app/kernel is running, and then spends 5 seconds running its own kernel, for a total of ~10 seconds:

    $ ./mps_run
    kernel duration: 6.409399s
    kernel duration: 12.078304s
    $
    
  7. On the other hand, if we start the MPS server first, and repeat the test:

    $ su
    Password:
    # ./start_as_root.bash
    Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:82:00.0.
    All done.
    # exit
    exit
    $ ./mps_run
    kernel duration: 6.167079s
    kernel duration: 6.263062s
    $
    

    we see that both apps take the same amount of time to run, because the kernels are running concurrently, due to MPS.

  8. You're welcome to experiment as you see fit. If this sequence appears to work correctly for you, but running your own application doesn't seem to give the expected results, one possible reason may be that your app/kernels are not able to run concurrently with other instances of the app/kernels due to the construction of your kernels, not anything to do with MPS. You might want to verify the requirements for concurrent kernels, and/or study the concurrentKernels sample app.

  9. Much of the information here was recycled from the test/work done here albeit the presentation here with separate apps is different than the MPI case presented there.

UPDATE: The scheduler behavior in the non-MPS case when running kernels from multiple processes appears to have changed with Pascal and newer GPUs. The above test results still are correct for the GPUs tested on (e.g. Kepler), but when running the above test case on a Pascal or newer GPU, different results will be observed in the non-MPS case. The scheduler is described as a "time-sliced" scheduler in the latest MPS doc and what appears to be happening is that rather than wait for a kernel from one process to complete, the scheduler may, according to some unpublished rules, choose to pre-empt a running kernel so that it can switch to another kernel from another process. This still doesn't mean that kernels from separate processes are running "concurrently" in the traditional usage of that word in CUDA documentation, but the above code is "tricked" by the time-sliced scheduler (on Pascal and newer) because it depends on using the SM clock to set kernel duration. The combination of the time-sliced scheduler plus this usage of the SM clock makes this test case appear to run "concurrently". However, as described in the MPS doc, the code from kernel A is not executing in the same clock cycle(s) as the code from kernel B, when A and B originate from separate processes in the non-MPS case.

An alternative method to demonstrate this using the above general approach might be to use a kernel duration that is set by a number of loops, rather than a kernel duration that is set by reading the SM clock, as described here. Care must be taken in that case to avoid having the loops "optimized out" by the compiler.

like image 145
Robert Crovella Avatar answered Oct 20 '22 15:10

Robert Crovella