Transformer model - Cyril Chhun

# Transformer model Introduced in 2017 in the now famous paper by Vaswani et al. "Attention Is All You Need,"[^1] the transformer architecture has swiftly conquered the NLP community. It differs from the other pre-existing models in that it relies entirely on the [[Attention mechanism]] (and the [[Self-attention mechanism]], its close relative) and does not use recurrent network units. ![Transformer animation|400](https://3.bp.blogspot.com/-aZ3zvPiCoXM/WaiKQO7KRnI/AAAAAAAAB_8/7a1CYjp40nUg4lKpW7covGZJQAySxlg8QCLcBGAs/s640/transform20fps.gif) ## Model architecture ![Transformer architecture|300](https://i.imgur.com/d1z7kCz.png) ### Encoder - Stack of $N=6$ identical layers, which have two sub-layers each: a multi-head self-attention mechanism and a fully connected feed-forward layer. - Residual connection: output of each sub-layer is $\mathrm{LayerNorm}(x + \mathrm{SubLayer}(x) )$. ### Decoder - Stack of $N=6$ identical layers - In addition to the two sub-layers from the encoder layer, a third layer performs multi-head attention over the output of the encoder - Self-attention modified with masks to prevent positions from attending to subsequent positions ### Attention #### Scaled dot-product attention ![Scaled dot-product attention|100](https://i.imgur.com/Oa9u2Lk.png) The idea is to map queries (contained in a matrix $Q$) to keys ($K$), to compute corresponding weights using the softmax function, and then to apply them to a matrix of values $V$. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{QK^\mathsf{T} }{\sqrt{d_k}} \right) V $ The scaling factor $\sqrt{d_k}$ is here to counteract the fact that dot products tend to grow large in magnitude when the dimensionality increases, pushing the softmax function into regions where gradients are very small. #### Multi-head attention ![Multi-head attention|200](https://i.imgur.com/UmKUaIY.png) Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, while a single attention head inhibits this through averaging. $ \begin{align} \mathrm{Multihead}(Q, K, V) & = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h ) W^O \\ \text{where} \quad \mathrm{head}_i & = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i) \\ \end{align} $ #### Applications of attention - Encoder-decoder attention: queries come from the previous decoder layer, keys and values come from the output of the encoder - Self-attention in the encoder: queries, keys and values come from the previous encoder layer - Self-attention in the decoder: queries, keys and values come from the previous decoder layer; leftward information flow is disabled by masking out all values in the softmax of the dot-product which are after the attended position ### Positional encoding Since there is no recurrence and no convolution, ==some information about position must be injected for the model== to learn from it. Learned and fixed positional encodings yielded similar results; they chose fixed encodings using sine and cosine functions to extrapolate to sequences longer than the ones encountered during training. $ \begin{align} \mathrm{PE}(\mathrm{pos}, 2i) & = \sin \left( \frac{\mathrm{pos}}{10000^{2i/d_{\mathrm{model} }}} \right) \\ \mathrm{PE}(\mathrm{pos}, 2i+1) & = \cos \left( \frac{\mathrm{pos}}{10000^{2i/d_{\mathrm{model} }}} \right) \\ \end{align} $ --- ## 📚 References - Alammar, Jay. [“The Illustrated Transformer.”](https://jalammar.github.io/illustrated-transformer/) [^1]: Vaswani, Ashish, et al. "Attention Is All You Need." _ArXiv:1706.03762 [Cs]_, Dec. 2017. _arXiv.org_, [http://arxiv.org/abs/1706.03762](http://arxiv.org/abs/1706.03762).