Self-attention mechanism

# Self-attention mechanism ![Animated self-attention](https://miro.medium.com/max/2000/1*_92bnsMJy8Bl539G4v93yg.gif) Self-attention is a variation of the [[Attention mechanism]], introduced by Bahdanau et al. (2015).[^1] Instead of taking querys $Q$ from the decoder, and keys $K$ and values $V$ from the encoder, all three matrices come from the same place (either encoder or decoder). Vaswani et al. (2017)[^2] argue that self-attention outperforms convolutional and recurrent layers in terms of computational complexity, parallelisation and the learning of long-range dependencies. --- ## 📚 References - Cheng, Jianpeng, et al. [“Long Short-Term Memory-Networks for Machine Reading.”](http://arxiv.org/abs/1601.06733) ArXiv:1601.06733 [Cs], Sept. 2016. arXiv.org. - Adaloglou, Nikolas. [“Why Multi-Head Self Attention Works: Math, Intuitions and 10+1 Hidden Insights.”](https://theaisummer.com/self-attention/) AI Summer, 25 Mar. 2021. - Futrzynski, Romain. [Self-Attention: Step-by-Step Video | Peltarion.](https://peltarion.com/blog/data-science/self-attention-video) - Karim, Raimi. [“Illustrated: Self-Attention.”](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a) Medium, 18 Feb. 2021. [^1]: Bahdanau, Dzmitry, et al. [“Neural Machine Translation by Jointly Learning to Align and Translate.”](http://arxiv.org/abs/1409.0473) _ArXiv:1409.0473 [Cs, Stat]_, May 2016. [^2]: Vaswani, Ashish, et al. [“Attention Is All You Need.”](http://arxiv.org/abs/1706.03762) ArXiv:1706.03762 [Cs], Dec. 2017. arXiv.org.