Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cudaMemcpy too slow

Tags:

cuda

bus

I use cudaMemcpy() one time to copy exactly 1GB of data to the device. This takes 5.9s. The other way round it takes 5.1s. Is this normal?
Does the function itself have so much overhead before copying? Theoretical there should be a throughput of at least 4GB/s for the PCIe bus.
There are no memory transfers overlapping because the Tesla C870 just does not support it. Any hints?

EDIT 2: my test program + updated timings; I hope it is not too much to read!
The cutCreateTimer() functions wont compile for me: 'error: identifier "cutCreateTimer" is undefined' - this could be related to the old cuda version (2.0) installed on the machine

 __host__ void time_int(int print){
static struct timeval t1; /* var for previous time stamp */
static struct timeval t2; /* var of current time stamp */
double time;
if(gettimeofday(&t2, 0) == -1) return;
if(print != 0){
  time = (double) (t2.tv_sec - t1.tv_sec) + ((double) (t2.tv_usec - t1.tv_usec)) / 1000000.0;
  printf(...);
}
t1 = t2;
}

main:
time(0);
void *x;
cudaMallocHost(&x,1073741824);
void *y;
cudaMalloc(&y, 1073741824);
time(1);
cudaMemcpy(y,x,1073741824, cudaMemcpyHostToDevice);
time(1);
cudaMemcpy(x,y,1073741824, cudaMemcpyDeviceToHost);
time(1);

Displayed timings are:
0.86 s allocation
0.197 s first copy
5.02 s second copy
The weird thing is: Although it displays 0.197s for first copy it takes much longer if I watch the program run.

like image 564
Callahan Avatar asked Sep 15 '11 11:09

Callahan


2 Answers

Yes, This is normal. cudaMemcpy() does a lot of checks and works (if host memory was allocated by usual malloc() or mmap()). It should check that every page of data is in memory, and move the pages (one-by-one) to the driver.

You can use cudaHostAlloc function or cudaMallocHost for allocating memory instead of malloc. It will allocate pinned memory which is always stored in RAM and can be accessed by GPU's DMA directly (faster cudaMemcpy()). Citing from first link:

Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy().

Only limiting factor is that total amount of pinned memory in system is limited (not more than RAM size; it is better to use not more than RAM - 1Gb):

Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

like image 77
osgx Avatar answered Sep 19 '22 13:09

osgx


Assuming the transfers are timed accurately, 1.1 seconds for a transfer of 1 GB from pinned memory seems slow. Are you sure the PCIe slot is configured to the correct width? For full performance, you'd want a x16 configuration. Some platforms provide two slots, one of which is configured as a x16, the other as a x4. So if you machine has two slots, you might want try moving the card into the other slot. Other systems have two slots, where you get x16 if only one slot is occupied, but you get two slots of x8 if both are occupied. The BIOS setup may help in figuring out how the PCIe slots are configured.

The Tesla C870 is rather old technology, but if I recall correctly transfer rates of around 2 GB/s from pinned memory should be possible with these parts, which used a 1st generation PCIe interface. Current Fermi-class GPUs use a PCIe gen 2 interface and can achieve 5+ GB/s for tranfers from pinned memory (for throughput measurements, 1 GB/s = 10^9 bytes/s).

Note that PCIe uses a packetized transport, and the packet overhead can be significant at the packet sizes supported by common chipsets, with newer chipsets typically supporting somewhat longer packets. One is unlikely to exceed 70% of the nominal per-direction maximum (4 GB/s for PCIe 1.0 x16, 8 GB/s for PCIe 2.0 x16), even for transfers from / to pinned host memory. Here is a white paper that explains the overhead issue and has a handy graph showing the utilization achievable with various packet sizes:

http://www.plxtech.com/files/pdf/technical/expresslane/Choosing_PCIe_Packet_Payload_Size.pdf

like image 25
njuffa Avatar answered Sep 19 '22 13:09

njuffa