the server have setup two NVIDIA K20m
cards, but with ECC
enabled. I have observed that the Volatile GPU-Utilization
is high using the nvidia-smi -a
command, even through no computing task is running in the card. The K20m
is just used for computing. I have searched in Google
, and checked the following links: https://devtalk.nvidia.com/default/topic/539632/k20-with-high-utilization-but-no-compute-processes-/ and https://devtalk.nvidia.com/default/topic/464744/how-to-disable-enable-ecc-on-c2050-/
It seems the ECC
is always a bad feature, so it is always set to be disabled
. So what's the true meaning with the ECC
? I'm just a commont user of that server, so I don't have the right to use the command nvidia-smi -e 0
to set ECC
to be disabled. Is it possible for common user to set the ECC
to be disabled?
What are the effects when we turn off the ECC
? When should we turn it on? and when off?
The GPU utilization may become non-zero when running nvidia-smi
even when no other compute tasks are running. This has no connection to ECC.
So what's the true meaning with the ECC?
ECC is Error Correcting Code. It is not unique to GPUs. On GPUs, it is a feature that uses extra memory bits to store error information, so that if an error (of particular severity) occurs in the memory subsystem it can either be detected and reported, or detected and corrected.
Is it possible for common user to set the ECC to be disabled?
Disabling ECC requires root privilege on linux.
What are the effects when we turn off the ECC?
The available bandwidth as well as memory size available to your GPU application may/will be increased. If you turn off ECC and a memory subsystem error occurs, you'll receive no explicit notification. The error could have any range of effects from no effect at all, to a complete application crash, depending on in what context the error occurred.
When should we turn it on? and when off?
Turn it on when you want protection from memory corruption errors. Turn it off if you want maximum performance (e.g. for benchmarking) or you believe that your application can tolerate memory errors (e.g. you check the validity of results and you don't mind re-running an application that failed for some reason.)
Note that some newer GPUs with HBM (HBM2) memory may have somewhat different characteristics. Because of the design of HBM2 memory, enabling ECC generally results in little or no performance loss (bandwidth) and no reduction in memory size. For GPUs with HBM2 memory, the general recommendation is to leave the ECC on all the time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With