I have a basic knowledge of parallel computing (including some CUDA), feedforward neural networks, and recurrent neural networks (and how they use BPTT).
When using for example TensorFlow you can apply GPU acceleration for the training phase of a network. But recurrent neural networks are sequential in nature, having timesteps where a current timestep depends on a previous, and the next timestep depends on the current, etc.
How come GPU acceleration works if it is like this? Is everything that can be computed in parallel computed in that way, while the timestep dependent parts are serialized?
RNNs train using backpropagation through time. The recurrent network structure is unfolded into a directed acyclic graph of finite length and looks just as a normal feedforward net would. It then trains using stochastic gradient descent where in between each time step there is a constraint that the weights must be equal.
If you understand that it trains like this, as in it is just constrained backpropagation on sequences of a given length, you see there is nothing about the sequential nature that is stopping this process from being parallelizable.
The way you can get performance for GPU training of recurrent neural networks is by using a large enough batch size that computing the forward/backward pass for a single cell consumes enough compute to make the GPU busy.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With