# Transformer model
Introduced in 2017 in the now famous paper by Vaswani et al. "Attention Is All You Need,"[^1] the transformer architecture has swiftly conquered the NLP community. It differs from the other pre-existing models in that it relies entirely on the [[Attention mechanism]] (and the [[Self-attention mechanism]], its close relative) and does not use recurrent network units.

## Model architecture

### Encoder
- Stack of $N=6$ identical layers, which have two sub-layers each: a multi-head self-attention mechanism and a fully connected feed-forward layer.
- Residual connection: output of each sub-layer is $\mathrm{LayerNorm}(x + \mathrm{SubLayer}(x) )$.
### Decoder
- Stack of $N=6$ identical layers
- In addition to the two sub-layers from the encoder layer, a third layer performs multi-head attention over the output of the encoder
- Self-attention modified with masks to prevent positions from attending to subsequent positions
### Attention
#### Scaled dot-product attention

The idea is to map queries (contained in a matrix $Q$) to keys ($K$), to compute corresponding weights using the softmax function, and then to apply them to a matrix of values $V$.
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax} \left( \frac{QK^\mathsf{T} }{\sqrt{d_k}} \right) V
$
The scaling factor $\sqrt{d_k}$ is here to counteract the fact that dot products tend to grow large in magnitude when the dimensionality increases, pushing the softmax function into regions where gradients are very small.
#### Multi-head attention

Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions, while a single attention head inhibits this through averaging.
$
\begin{align} \mathrm{Multihead}(Q, K, V) & = \mathrm{Concat}(\mathrm{head}_1, \dots, \mathrm{head}_h ) W^O \\
\text{where} \quad \mathrm{head}_i & = \mathrm{Attention}(QW^Q_i, KW^K_i, VW^V_i) \\
\end{align}
$
#### Applications of attention
- Encoder-decoder attention: queries come from the previous decoder layer, keys and values come from the output of the encoder
- Self-attention in the encoder: queries, keys and values come from the previous encoder layer
- Self-attention in the decoder: queries, keys and values come from the previous decoder layer; leftward information flow is disabled by masking out all values in the softmax of the dot-product which are after the attended position
### Positional encoding
Since there is no recurrence and no convolution, ==some information about position must be injected for the model== to learn from it. Learned and fixed positional encodings yielded similar results; they chose fixed encodings using sine and cosine functions to extrapolate to sequences longer than the ones encountered during training.
$
\begin{align} \mathrm{PE}(\mathrm{pos}, 2i) & = \sin \left( \frac{\mathrm{pos}}{10000^{2i/d_{\mathrm{model} }}} \right) \\
\mathrm{PE}(\mathrm{pos}, 2i+1) & = \cos \left( \frac{\mathrm{pos}}{10000^{2i/d_{\mathrm{model} }}} \right) \\ \end{align}
$
---
## π References
- Alammar, Jay. [βThe Illustrated Transformer.β](https://jalammar.github.io/illustrated-transformer/)
[^1]: Vaswani, Ashish, et al. "Attention Is All You Need." _ArXiv:1706.03762 [Cs]_, Dec. 2017. _arXiv.org_, [http://arxiv.org/abs/1706.03762](http://arxiv.org/abs/1706.03762).