I am following this blog on transformers
http://jalammar.github.io/illustrated-transformer/
The only thing I don't understand is why there needs to be a stack of encoders or decoders. I understand that the multi-headed attention layers capture different representation spaces of the problem. I don't understand why there needs to be a vertical stack of encoders and decoders. Wouldn't one encoder/decoder layer work?
Stacking layer is what makes any deep learning architecture powerful, using a single encoder/decoder with attention wouldn't be able to capture the complexity needed to model an entire language or archive high accuracy on tasks as complex as language translation, the use of stacks of encoder/decoders allows the network to extract hierarchical features and model complex problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With