Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture

Tags:

In the Kepler architecture whitepaper, NVIDIA states that there are 32 Special Function Units (SFUs) and 32 Load/Store Units (LD/ST) on a SMX.

The SFU are for "fast approximate transcendental operations". Unfortunately, I don't understand what this is supposed to mean. On the other hand, at Special CUDA Double Precision trig functions for SFU it is said, that they only work in single precision. Is this still correct on a K20Xm?

The LD/ST units are obviously for storing and loading. Is any memory load/write required to go through one of theses? And are they also used as a single warp? In other words, can there be only one warp which is currently writing or reading?

Cheers, Andi

818

asked Dec 09 '13 14:12

user2267896

2 Answers

The SFU are for "fast approximate transcendental operations"

SFUs compute functions like __cosf(), __expf() etc.

On the other hand here is said, that they only work in single precision, is this still correct on a K20Xm?

According to recent CUDA C Programming Guide, section G.5.1 they still only work in single precision.

It makes some sense, since if you need double precision it's unlikely you would use inaccurate math functions. You can refer to this answer for suggestions on double-precision arithmetic optimizarions.

The implementation details of double-precision operations could be found in /usr/local/cuda-5.5/include/math_functions_dbl_ptx3.h (or wherever your CUDA Toolkit is installed). E.g. for sin and cos it uses Payne-Hanek argument reduction followed by Taylor expansion (up to the order 14).

For double precision calculations, SFUs seem to be used only in __internal_fast_rcp and __internal_fast_rsqrt, which in turn are used in acos, log, cosh and several other functions (see math_functions_dbl_ptx3.h). So most of the time they stall, like LD/ST units stall if there's no ongoing memory transactions.

Is any memoryload/write required to go through one of theses?

Yes, each access to global memory.

And are they also used as a single warp? In other words can there be only one warp which is currently writing or reading?

The number of units constrains only the number of instructions issued each cycle. I.e. each clock cycle 32 read instructions could be issued, and 32 results could be returned.

One instruction can read/write up to 128 bytes, so if each thread in warp reads 4 bytes and they are coalesced, then whole warp would require a single load/store instruction. If accesses are uncoalesced, then more instruction should be issued.

Moreover, units are pipelined, meaning multiple read/store request could be executing concurrently by single unit.

102

answered Sep 20 '22 14:09

aland

Don't accept this as an answer -- we're hoping that someone will come along and answer your question about double precision transcendental operations. I just wanted to address the second part of your question, about the LD/ST units.

The LD/ST units are obviously for storing and loading.

Yes.

Is any memoryload/write required to go through one of theses?

Yes.

And are they also used as a single warp?

Yes, all active threads in a warp always issue the same type of instruction in the same clock cycle. If that instruction is a load or store, it gets issued to the LD/ST units. If a thread is inactive (due to looping or conditional execution), the corresponding LT/ST unit stays idle.

In other words can there be only one warp which is currently writing or reading?

No, the LD/ST units can accept one load or store operation per clock, even though memory latency can be several hundred cycles. So, when one warp issues a load instruction, the LD/ST units will start working on retrieving that data. Instructions in the warp that depend on the data become ineligible to be issued until the data arrives. In the next clock cycle, the warp may still execute other independent instructions (instruction-level parallelism). Even other, independent load or store instructions. Another warp that is eligible to be scheduled may also, in the next clock cycle, issue another load instruction and itself go into a waiting state (thread-level parallelism). At that point, the LD/ST units are keeping track of two pending results. Due to caching and coalescing, it is possible that the data for the second warp arrives first. When data for a warp arrives it gets assigned to the registers designated in the instruction and that particular data dependency is then resolved.

answered Sep 16 '22 14:09

Roger Dahl

Related questions
                            
                                nVidia Thrust: device_ptr Const-Correctness
                            
                                NSight attach shows no available processes
                            
                                Profiling MATLAB mex CUDA applications with the NVIDIA visual profiler
                            
                                How can I use TensorFlow without CUDA on Linux?
                            
                                Thread synchronization with syncwarp
                            
                                Idiom for CUDA class static member in device code?
                            
                                How to use CUDA pinned "zero-copy" memory for a memory mapped file?
                            
                                question about modifing flag array in cuda
                            
                                How to integrate CUDA .cu code with C++ app
                            
                                CUDA finding the max value in given array
                            
                                CUDA: Getting max value and its index in an array
                            
                                OpenCV CUDA running slower than OpenCV CPU
                            
                                CUDA C# .Net Example Project? VS2010
                            
                                Removing __syncthreads() in CUDA warp-level reduction
                            
                                Depth-first search in CUDA / OpenCL
                            
                                The cost of CUDA global memory transactions
                            
                                Will 32 threads from 32 block be scheduled as a warp?
                            
                                How to differentiate between pointers to shared and global memory?
                            
                                Difference between memcpy_htod and to_gpu in Pycuda?
                            
                                How to disable a specific nvcc compiler warnings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Load/Store Units (LD/ST) and Special Function Units (SFUs) for the Kepler architecture

Tags:

cuda

nvidia

kepler

user2267896

People also ask

2 Answers

aland

Roger Dahl

Recent Activity

Donate For Us