Asynchronous GPU memory transfer with cupy

Question

Is it possible to asynchronously transfer memory from/to GPU with cupy (or chainer)?

I'm training a relatively small network with very large data that does not fit into the GPU memory. This data should be kept on CPU memory and provided to GPU for its minibatch calculation sequentially.

The memory transfer time is the dominant bottleneck of this application. I think the asynchronous memory transfer solves this problem, i.e. during the calculation of one minibatch, another minibatch is transferred to GPU in the background.

I'm wondering it would be possible with cupy.cuda.Stream class, but I have no idea yet. I would appreciate any comments/advice.

EDIT: I thought the following codes makes asynchronous memory transfer, but not.

import numpy as np
import cupy as cp

a_cpu = np.ones((10000, 10000), dtype=np.float32)
b_cpu = np.ones((10000, 10000), dtype=np.float32)

a_stream = cp.cuda.Stream(non_blocking=True)
b_stream = cp.cuda.Stream(non_blocking=True)

a_gpu = cp.empty_like(a_cpu)
b_gpu = cp.empty_like(b_cpu)

a_gpu.set(a_cpu, stream=a_stream)
b_gpu.set(b_cpu, stream=b_stream)

# This should start before b_gpu.set() is finished.
a_gpu *= 2

The nvvp shows the memory transfer takes place sequentially.

Keisuke FUJII · Accepted Answer

I found one solution by diving into chainer source code.

An essential point seems to keep a fixed memory buffer when constructing np.ndarray.

def pinned_array(array):
    # first constructing pinned memory
    mem = cupy.cuda.alloc_pinned_memory(array.nbytes)
    src = numpy.frombuffer(
                mem, array.dtype, array.size).reshape(array.shape)
    src[...] = array
    return src

a_cpu = np.ones((10000, 10000), dtype=np.float32)
b_cpu = np.ones((10000, 10000), dtype=np.float32)
# np.ndarray with pinned memory
a_cpu = pinned_array(a_cpu)
b_cpu = pinned_array(b_cpu)

a_stream = cp.cuda.Stream(non_blocking=True)
b_stream = cp.cuda.Stream(non_blocking=True)

a_gpu = cp.empty_like(a_cpu)
b_gpu = cp.empty_like(b_cpu)

a_gpu.set(a_cpu, stream=a_stream)
b_gpu.set(b_cpu, stream=b_stream)

# wait until a_cpu is copied in a_gpu
a_stream.synchronize()
# This line runs parallel to b_gpu.set()
a_gpu *= 2

Asynchronous GPU memory transfer with cupy

Tags:

python

chainer

cupy

Keisuke FUJII

1 Answers

Keisuke FUJII

Recent Activity

Donate For Us

Asynchronous GPU memory transfer with cupy

Tags:

python

chainer

cupy

Keisuke FUJII

1 Answers

Keisuke FUJII

Related questions

Recent Activity

Donate For Us