[📌 namdarine’s AI Review] Attention is All You Need

📚 Transformer Paper Review The Transformer is one of the most influential architectures in deep learning—-forming the backbone of most modern generative AI systems, including ChatGPT. In this review, I will unpack the paper that started it all: “Attention Is All You Need.”

Multi-head attention flow — Generated by Gen AI

Paper Summary

A new architecture—the Transformer—that revolutionized parallelization and modeling long-range dependencies.

TL;DR

The Transformer replaces recurrence with attention, enabling parallel training and modeling long-range dependencies without sequential computation.
Its core building blocks are scaled dot-product attention, multi-head attention, position-wise feed-forward networks, and sinusoidal positional encoding.
In the encoder, self-attention lets every token attend to all others; in the decoder, masked self-attention preserves autoregressive generation, while encoder-decoder attention connects output tokens to the full input sequence.
The architecture uses residual connections + layer normalization across stacked encoder/decoder layers to stabilize and scale training.
On WMT 2014 EN-DE and EN-FR, the paper reports state-of-the-art BLEU scores for both Base and Big models, showing strong quality with efficient training.

Limitations of Previous Models and the Problem Setup

RNNs, LSTMs, and GRUs are still strong models, but they come with constraints when you try to parallelize them. One major issue is memory limitations, which cap how effectively you can batch training examples. Techniques like factorization tricks and conditional computation were proposed, but the fundamental bottleneck remains: sequential computation.

To overcome this, the authors introduce the Transformer, a new architecture that avoids recurrence and uses attention mechanisms to capture global dependencies between inputs and outputs.

Recurrent Models

Recurrent models such as RNNs compute sequentially in the following way:

The computation is decomposed along symbol positions in the input and output sequences.
At each step, the model generates the current hidden state $h_{t}$ as a function of the previous hidden state $h_{t-1}$ and the input at position $t$ .
-> The hidden state $h_{t}$ can be thought of as a temporary memory that carries information learned from previous tokens.

This approach has two major limitations:

No parallelization within a training example. This becomes more severe as sequence length grows, and learning long-range dependencies becomes harder.
Batching is limited by memory constraints, making it difficult to scale across many examples.

Core Proposal

The paper introduces three key components: Attention, position-wise feed-forward networks, and positional encoding.

Attention

Before diving into how attention is used in the Transformer, let’s quickly clarify how attention works and what types are used here.

In general, attention maps an output based on a query and a set of key-value pairs. The query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where each weight is determined by a compatibility function between the query and its corresponding key.

You can think of attention like a meeting where everyone can simultaneously reference everyone else’s statements to form their opinion.

Scaled Dot-Product Attention

This is the specific attention mechanism used in the Transformer. Inputs consist of queries (Q), keys (K), and values (V): queries and keys have dimension $d_{k}$ , and values have dimension $d_{v}$ . Multiple queries/keys/values are processed together as matrices.

Computation steps:

Step	Description
1. Dot product	Compute dot products between the query and all keys: $QK^{t}$
2. Scaling	Divide by $\sqrt{d_k}$ to prevent large dot products from pushing softmax into regions with tiny gradients
3. SoftMax	Apply softmax to obtain attention weights
4. Weighted sum	Multiply weights by $V$ and sum: $softmax(\frac{QK^{t}}{\sqrt{d_k}})V$

A key advantage is that dot-product attention has similar theoretical complexity to additive attention (which uses a feed-forward network with one hidden layer), but in practice it is much faster and more memory-efficient thanks to highly optimized matrix multiplication.

Multi-Head Attention

Instead of running a single attention function with $d_{model}$ -dimensional $Q$ , $K$ , and $V$ , the Transformer linearly projects $Q$ , $K$ , and $V$ into $h$ different subspaces using learned projections, each with dimensions $d_{k}$ , $d_{k}$ , $d_{v}$ . Attention is then performed in parallel for each head, producing $d_{v}$ -dimensional outputs. These outputs are concatenated and linearly projected again to form the final output.

The goal is to let the model attend to information from different representation subspaces at different positions simultaneously. With only one head, this capability can be limited due to averaging effects.

In this paper, they use $h = 8$ heads, with $d_{k} = d_{v} = \frac{d_{model}}{h} = 64$ . Since each head is smaller, the overall computation cost stays similar to a single full-dimensional head.

Multi-head attention is like one person simultaneously thinking as a designer, a developer, and an accountant—seeing the same problem through multiple lenses at once.

How Is Attention Used Inside the Transformer?

Encoder-Decoder Attention
The decoder uses queries from the previous decoder layer and takes keys/values from the encoder output, allowing every decoder position to attend to every position in the input sequence.
Encoder Self-Attention
In the encoder self-attention layer, $Q$ , $K$ , and $V$ all come from the previous encoder layer output, enabling each position to attend to every other position in the encoder.
Decoder Self-Attention
In decoder self-attention, each position can attend only to previous positions (including itself). This preserves the auto-regressive property by masking future tokens.

With these mechanisms, the Transformer can model dependencies regardless of token distance-without recurrence or convolution. This directly addresses the sequential bottleneck of recurrent models and enables faster training and better translation quality.

Encoder-Decoder Architecture in the Transformer

Encoding Decoding stact — Figure 1. The Transformer - model architecture. Source: Vaswani et al., 2017, "Attention Is All You Need" (2017)

Encoder Stack

The encoder is a stack of $N = 6$ identical layers. Each layer has two sub-layers:

Multi-head self-attention
A position-wise fully connected feed-forward network

Each sub-layer uses a residual connection followed by layer normalization:
$LayerNorm(x + Sublayer(x))$

All sub-layers and embeddings produce outputs of dimension $d_{model} = 512$ to make residual connections straightforward.

Residual connections are like solving a problem with hints. Instead of replacing your work, the teacher preserves what you have done and only adds what you are missing. In the same way, each Transformer layer keeps existing information and learns only a corrective update:
$x + f(x) = existing signal + learned adjustment$

Decoder Stack

The decoder is also composed of $N = 6$ identical layers. Each decoder layer includes three sub-layers: masked self-attention, encoder-decoder multi-head attention, and a position-wise feed-forward network.

Like the encoder, each sub-layer uses residual connections and layer normalization.

Decoder self-attention is modified to prevent information from flowing from future positions. This is implemented via masking, where illegal connections in the softmax inputs are masked (set to negative infinity). This preserves the auto-regressive property by ensuring that, at position $i$ , the decoder can attend only to representations of previous positions in the sequence.

This architecture removes the fundamental sequential bottleneck of recurrent models, enabling significantly more parallelization and resulting in both reduced training time and improved translation quality.

In simple terms: the encoder builds a condensed understanding of the full sentence, and the decoder generates tokens based on that understanding. Both are stacked 6 layers deep, with mechanisms that filter and refine information flow.

Why Self-Attention? (Table 1)

Total computational complexity per layer

Layer	Description
Self-attention	Faster than recurrent layers when sequence length $n$ is smaller than representation dimension $d$ ; complexity $O(n^{2} \cdot d)$
Recurrent	$O(n \cdot d^{2})$
Convolution	Depends on kernel size $k$ : $O(k \cdot n \cdot d^{2})$

The paper notes that for very long sequences, self-attention can be restricted to a neighborhood of size $r$ , reducing complexity to $O(r \cdot n \cdot d)$ . It also explains that separable convolutions with $k = n$ can have complexity similar to combining self-attention with positional feed-forward layers.

Amount of parallelizable computation

Layer	Description
Self-attention	Maximum path length $O(1)$ : enables high parallelization
Recurrent	Requires $O(n)$ sequential operations
Convolution	$O(1)$ sequentially, but multiple layers are needed to connect all positions

Path length for long-range dependencies

Layer	Description
Self-attention	$O(1)$ : any positions can connect through a constant number of steps
Recurrent	$O(n)$
Convolution	$O(\frac{n}{k})$ for contiguous kernels; $O(\log_{k}{n})$ for dilated convolutions

Summary table:

Metric	Self-Attention	RNN	CNN
Compute complexity	$O(n^{2} \cdot d)$	$O(n \cdot d^{2})$	$O(k \cdot n \cdot d^{2})$
Parallelization	Very high ( $O(1)$ )	Low ( $O(n)$ )	Medium ( $O(1)$ )
Long-range path length	$O(1)$	$O(n)$	$O(\frac{n}{k})$ to $O(\log_{k}{n}) v$

Side benefit: self-attention can make models more interpretable. Individual heads often learn distinct roles, and many heads exhibit behaviors tied to syntactic or semantic structure.

By leveraging these advantages, the Transformer can train much faster than prior architectures while delivering higher translation quality.

Position-Wise Feed-Forward Networks

Each encoder and decoder layer includes a fully connected feed-forward network applied independently to each position (the same way across positions). It consists of two linear transformations with a ReLU in between:
$FFN(x) = max(0, xW_{1} + b_{1})W_{2} + b_{2}$

The same transformation is applied at every position, but each layer uses different parameters. This can also be viewed as two $1 \times 1$ convolutions.
Input/output dimension is $d_{model} = 512$ , and the inner dimension is $d_{ff} = 2048$ . While the paper does not explicitly justify 2048, it reports better performance with larger $d_{ff}$ :

$d_{ff} = 2048$ : BLEU 25.8
$d_{ff} = 1024$ : BLEU 25.4
$d_{ff} = 4096$ : BLEU 26.2 (Table 3)

You can think of the feed-forward network as a “filtering module” that refines each token’s meaning to better fit its context.*

Positional Encoding

Since the Transformer has neither recurrence nor convolution, it needs a way to inject information about token positions. The paper introduces positional encodings, which are added to the input embeddings at the bottom of the encoder and decoder stacks. Positional encodings have dimension $d_{model}$ so they can be added directly to embeddings.

Implementation

Uses sine and cosine functions:
- even dimensions
$PE_{pos, 2i} = \sin{\frac{pos}{10000^{\frac{2i}{d_{model}}}}}$
- odd dimensions
$PE_{pos, 2i + 1} = \cos{\frac{pos}{10000^{\frac{2i}{d_{model}}}}}$
Here, $pos$ is the absolute position in the sentence.

Even if the formula looks complex, the idea is simple: it assigns a unique, structured “pattern” to each position—essentially giving every token a mathematical coordinate.

Why this design?
The authors hypothesize that for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ , making it easier for the model to learn relative position attention. It may also help the model extrapolate to sequences longer than those seen during training.

Positional encoding is like a navigator: it gives each token a coordinate so the model understands not just meaning, but ordering. It is like knowing not only the content of a book but also what page you are on—because the same sentence can mean something completely different at the beginning versus the end.
Since the Transformer has no recurrence, it does not inherently know word order. So it assigns position information—like a page number—using sine/cosine patterns. This allows the model to learn order-based meaning, such as “who modifies whom” and how a sentence unfolds, like pinning GPS coordinates onto words.

Training

Data

The paper trains the model on two datasets:

WMT 2014 English-German with ~4.5M sentence pairs
WMT 2014 English-French, a much larger dataset with ~36M sentence pairs

For EN-DE, they use a shared source-target vocabulary of about 37,000 tokens and apply byte-pair encoding (BPE).
For EN-FR, they split text into 32,000 word pieces.

Batching and Schedule

Sentence pairs were batched together by approximate sequence length. Each batch contained about 25,000 source tokens and 25,000 target tokens.

Training used 8 NVIDIA P100 GPUs:

Base model: 100,000 steps (~12 hours), ~0.4 seconds/step
Big model: 300,000 steps (~3.5 days), ~1.0 second/step

Optimization and Regularization

Optimizer: Adam with $\epsilon_{1} = 0.9$ , $\epsilon_{2} = 0.98$ , $\vartheta = 10^{-9}$
Learning rate schedule: Increases linearly for warmup_steps = 4000, then decays proportionally to the inverse square root of step count
Regularization: Residual dropout (Base: $P_{drop} = 0.1$ ) and label smoothing ( $\vartheta_{ls} = 0.1$ )

Results

The Transformer achieved strong performance in machine translation and English constituency parsing.

Machine Translation

Model	EN-DE BLEU	EN-FR BLEU	Training Cost (FLOPs)	Notes
Transformer (Base model)	27.3	38.1	$2.2 \cdot 10^{18}$ (EN-DE)	Outperformed prior models with less training cost
Transformer (Big model)	28.4	41.8	$2.3 \cdot 10^{19}$ (EN-DE)	New state-of-the-art on both tasks

WMT 2014 EN-DE:
- Big Transformer reached 28.4 BLEU, over 2.0 BLEU higher than the best previous model (including ensembles).
- Base model also beat all prior models and ensembles using only a fraction of their training cost.
WMT 2014 EN-FR:
- Big Transformer achieved 41.8 BLEU, setting a new single-model state-of-the-art, again with only a fraction of the training cost reported in prior work.

Model Variations (Ablations)
On the newstest2013 dev set, the paper evaluates how different components affect performance:

Multi-head attention: a single head performs ~0.9 BLEU worse than the best configuration; too many heads also degrade quality.
Key size $d_{k}$ : reducing key size lowers quality.
Scale: larger models perform better; dropout helps prevent overfitting.
Positional encoding: learned positional embeddings perform nearly the same as sinusoidal encodings. Sinusoidal was chosen because it may extrapolate better to longer sequences.

English Constituency Parsing
The Transformer also generalizes well beyond translation.

Using only WSJ training set (~40K sentences), it achieved 91.3 F1.
In a semi-supervised setup with ~17M sentences, it achieved 92.7 F1, outperforming all previously reported models except Recurrent Neural Network Grammar.
Unlike RNN seq2seq models, the Transformer trained only on WSJ still outperformed the Berkeley Parser.

Conclusion

The Transformer is the first sequence transduction model to completely replace recurrent layers with multi-head self-attention.

Key takeaways:

Superior performance: new state-of-the-art results on both WMT 2014 EN-DE and EN-FR. In EN-DE, the best model even surpassed all previously reported ensemble models.
Efficiency: trains much faster than recurrent- or convolution-based architectures.
Parallelization: by removing recurrence and convolution and relying entirely on attention, the Transformer enables far more parallel computation.

Closing

The Transformer was a breakthrough that overcame the limitations of RNNs—enabling both parallelization and long-range dependency learning. The models that followed—GPT, BERT, T5, Gemini, and more—all trace their starting point back to this paper. As the title says, the essence of the architecture is simple: Attention is all you need.

namdarine will keep creating content that makes foundational technologies like this easier to understand—and easier to use.

📌 namdarine’s AI Review is a series that breaks down papers, algorithms, and architectures so anyone can understand the core ideas behind AI.

Let’s build it like it’s already happened.
→ See you in the next review!