Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to deal with the ECC support feature in NVIDIA graphics cards

Tags:

cuda

nvidia

the server have setup two NVIDIA K20m cards, but with ECC enabled. I have observed that the Volatile GPU-Utilization is high using the nvidia-smi -a command, even through no computing task is running in the card. The K20m is just used for computing. I have searched in Google, and checked the following links: https://devtalk.nvidia.com/default/topic/539632/k20-with-high-utilization-but-no-compute-processes-/ and https://devtalk.nvidia.com/default/topic/464744/how-to-disable-enable-ecc-on-c2050-/

It seems the ECC is always a bad feature, so it is always set to be disabled. So what's the true meaning with the ECC? I'm just a commont user of that server, so I don't have the right to use the command nvidia-smi -e 0 to set ECC to be disabled. Is it possible for common user to set the ECC to be disabled?

What are the effects when we turn off the ECC? When should we turn it on? and when off?

like image 515
mining Avatar asked Sep 07 '14 10:09

mining


1 Answers

The GPU utilization may become non-zero when running nvidia-smi even when no other compute tasks are running. This has no connection to ECC.

So what's the true meaning with the ECC?

ECC is Error Correcting Code. It is not unique to GPUs. On GPUs, it is a feature that uses extra memory bits to store error information, so that if an error (of particular severity) occurs in the memory subsystem it can either be detected and reported, or detected and corrected.

Is it possible for common user to set the ECC to be disabled?

Disabling ECC requires root privilege on linux.

What are the effects when we turn off the ECC?

The available bandwidth as well as memory size available to your GPU application may/will be increased. If you turn off ECC and a memory subsystem error occurs, you'll receive no explicit notification. The error could have any range of effects from no effect at all, to a complete application crash, depending on in what context the error occurred.

When should we turn it on? and when off?

Turn it on when you want protection from memory corruption errors. Turn it off if you want maximum performance (e.g. for benchmarking) or you believe that your application can tolerate memory errors (e.g. you check the validity of results and you don't mind re-running an application that failed for some reason.)

Note that some newer GPUs with HBM (HBM2) memory may have somewhat different characteristics. Because of the design of HBM2 memory, enabling ECC generally results in little or no performance loss (bandwidth) and no reduction in memory size. For GPUs with HBM2 memory, the general recommendation is to leave the ECC on all the time.

like image 87
Robert Crovella Avatar answered Oct 31 '22 11:10

Robert Crovella