Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between cuda vs tensor cores?

Tags:

cuda

gpu

nvidia

I am completely new to terms related to HPC computing, but I just saw that EC2 released its new type of instance on AWS that's powered by the new Nvidia Tesla V100, which has both kinds of "cores": Cuda Cores (5,120) and Tensor Cores (640). What is the difference between both?

like image 563
Simon Ernesto Cardenas Zarate Avatar asked Nov 16 '17 16:11

Simon Ernesto Cardenas Zarate


People also ask

Does TensorFlow use CUDA or tensor cores?

The TensorFlow container includes support for Tensor Cores starting in Volta's architecture, available on Tesla V100 GPUs. Tensor Cores deliver up to 12x higher peak TFLOPs for training.

How many CUDA cores are in a tensor core?

Now only Tesla V100 and Titan V have tensor cores. Both GPUs have 5120 cuda cores where each core can perform up to 1 single precision multiply-accumulate operation (e.g. in fp32: x += y * z) per 1 GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz). Each tensor core perform operations on small matrices with size 4x4.

Does Pytorch use tensor cores or CUDA cores?

The only requirements are Pytorch 1.6+ and a CUDA-capable GPU. Mixed precision primarily benefits Tensor Core-enabled architectures (Volta, Turing, Ampere).

Are CUDA cores the same as GPU cores?

That is what GPUs have. And that is why GPUs are so much slower than CPUs for general-purpose serial computing, but so much faster for parallel computing. The cores on a GPU are usually referred to as “CUDA Cores” or “Stream Processors.”


2 Answers

Now only Tesla V100 and Titan V have tensor cores. Both GPUs have 5120 cuda cores where each core can perform up to 1 single precision multiply-accumulate operation (e.g. in fp32: x += y * z) per 1 GPU clock (e.g. Tesla V100 PCIe frequency is 1.38Gz).

Each tensor core perform operations on small matrices with size 4x4. Each tensor core can perform 1 matrix multiply-accumulate operation per 1 GPU clock. It multiplies two fp16 matrices 4x4 and adds the multiplication product fp32 matrix (size: 4x4) to accumulator (that is also fp32 4x4 matrix).

It is called mixed precision because input matrices are fp16 but multiplication result and accumulator are fp32 matrices.

Probably, the proper name would be just 4x4 matrix cores however NVIDIA marketing team decided to use "tensor cores".

like image 139
Artur Avatar answered Sep 19 '22 11:09

Artur


GPU’s have always been good for machine learning. GPU cores were originally designed for physics and graphics computation, which involves matrix operations. General computing tasks do not require lots of matrix operations, so CPU’s are much slower at these. Physics and graphics are also far easier to parallelise than general computing tasks, leading to the high core count.

Due to the matrix heavy nature of machine learning (neural nets), GPU’s were a great fit. Tensor cores are just more heavily specialised to the types of computation involved in machine learning software (such as Tensorflow).

Nvidia have written a detailed blog here, which goes into far more detail on how Tensor cores work and the preformance improvements over CUDA cores.

like image 29
MikeS159 Avatar answered Sep 20 '22 11:09

MikeS159