# Self-attention mechanism

Self-attention is a variation of the [[Attention mechanism]], introduced by Bahdanau et al. (2015).[^1]
Instead of taking querys $Q$ from the decoder, and keys $K$ and values $V$ from the encoder, all three matrices come from the same place (either encoder or decoder).
Vaswani et al. (2017)[^2] argue that self-attention outperforms convolutional and recurrent layers in terms of computational complexity, parallelisation and the learning of long-range dependencies.
---
## 📚 References
- Cheng, Jianpeng, et al. [“Long Short-Term Memory-Networks for Machine Reading.”](http://arxiv.org/abs/1601.06733) ArXiv:1601.06733 [Cs], Sept. 2016. arXiv.org.
- Adaloglou, Nikolas. [“Why Multi-Head Self Attention Works: Math, Intuitions and 10+1 Hidden Insights.”](https://theaisummer.com/self-attention/) AI Summer, 25 Mar. 2021.
- Futrzynski, Romain. [Self-Attention: Step-by-Step Video | Peltarion.](https://peltarion.com/blog/data-science/self-attention-video)
- Karim, Raimi. [“Illustrated: Self-Attention.”](https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a) Medium, 18 Feb. 2021.
[^1]: Bahdanau, Dzmitry, et al. [“Neural Machine Translation by Jointly Learning to Align and Translate.”](http://arxiv.org/abs/1409.0473) _ArXiv:1409.0473 [Cs, Stat]_, May 2016.
[^2]: Vaswani, Ashish, et al. [“Attention Is All You Need.”](http://arxiv.org/abs/1706.03762) ArXiv:1706.03762 [Cs], Dec. 2017. arXiv.org.