Algorithms & Models

[📌 namdarine’s AI Review] Attention is All You Need

namdarine

📚 Transformer Paper Review The Transformer is one of the most influential architectures in deep learning—-forming the backbone of most modern generative AI systems, including ChatGPT. In this review, I will unpack the paper that started it all: “Attention Is All You Need.

Multi-head attention flow
Generated by Gen AI

Paper Summary

A new architecture—the Transformer—that revolutionized parallelization and modeling long-range dependencies.

TL;DR

  • The Transformer replaces recurrence with attention, enabling parallel training and modeling long-range dependencies without sequential computation.
  • Its core building blocks are scaled dot-product attention, multi-head attention, position-wise feed-forward networks, and sinusoidal positional encoding.
  • In the encoder, self-attention lets every token attend to all others; in the decoder, masked self-attention preserves autoregressive generation, while encoder-decoder attention connects output tokens to the full input sequence.
  • The architecture uses residual connections + layer normalization across stacked encoder/decoder layers to stabilize and scale training.
  • On WMT 2014 EN-DE and EN-FR, the paper reports state-of-the-art BLEU scores for both Base and Big models, showing strong quality with efficient training.

Limitations of Previous Models and the Problem Setup

RNNs, LSTMs, and GRUs are still strong models, but they come with constraints when you try to parallelize them. One major issue is memory limitations, which cap how effectively you can batch training examples. Techniques like factorization tricks and conditional computation were proposed, but the fundamental bottleneck remains: sequential computation.

To overcome this, the authors introduce the Transformer, a new architecture that avoids recurrence and uses attention mechanisms to capture global dependencies between inputs and outputs.

Recurrent Models

Recurrent models such as RNNs compute sequentially in the following way:

  1. The computation is decomposed along symbol positions in the input and output sequences.
  2. At each step, the model generates the current hidden state hth_{t} as a function of the previous hidden state ht1h_{t-1} and the input at position tt.
    -> The hidden state hth_{t} can be thought of as a temporary memory that carries information learned from previous tokens.

This approach has two major limitations:

  • No parallelization within a training example. This becomes more severe as sequence length grows, and learning long-range dependencies becomes harder.
  • Batching is limited by memory constraints, making it difficult to scale across many examples.

Core Proposal

The paper introduces three key components: Attention, position-wise feed-forward networks, and positional encoding.

Attention

Before diving into how attention is used in the Transformer, let’s quickly clarify how attention works and what types are used here.

In general, attention maps an output based on a query and a set of key-value pairs. The query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where each weight is determined by a compatibility function between the query and its corresponding key.

You can think of attention like a meeting where everyone can simultaneously reference everyone else’s statements to form their opinion.

Scaled Dot-Product Attention

This is the specific attention mechanism used in the Transformer. Inputs consist of queries (Q), keys (K), and values (V): queries and keys have dimension dkd_{k}, and values have dimension dvd_{v}. Multiple queries/keys/values are processed together as matrices.

Computation steps:

StepDescription
1. Dot productCompute dot products between the query and all keys: QKtQK^{t}
2. ScalingDivide by dk\sqrt{d_k} to prevent large dot products from pushing softmax into regions with tiny gradients
3. SoftMaxApply softmax to obtain attention weights
4. Weighted sumMultiply weights by VV and sum: softmax(QKtdk)Vsoftmax(\frac{QK^{t}}{\sqrt{d_k}})V

A key advantage is that dot-product attention has similar theoretical complexity to additive attention (which uses a feed-forward network with one hidden layer), but in practice it is much faster and more memory-efficient thanks to highly optimized matrix multiplication.

Multi-Head Attention

Instead of running a single attention function with dmodeld_{model}-dimensional QQ, KK, and VV, the Transformer linearly projects QQ, KK, and VV into hh different subspaces using learned projections, each with dimensions dkd_{k}, dkd_{k}, dvd_{v}. Attention is then performed in parallel for each head, producing dvd_{v}-dimensional outputs. These outputs are concatenated and linearly projected again to form the final output.

The goal is to let the model attend to information from different representation subspaces at different positions simultaneously. With only one head, this capability can be limited due to averaging effects.

  • In this paper, they use h=8h = 8 heads, with dk=dv=dmodelh=64d_{k} = d_{v} = \frac{d_{model}}{h} = 64. Since each head is smaller, the overall computation cost stays similar to a single full-dimensional head.

Multi-head attention is like one person simultaneously thinking as a designer, a developer, and an accountant—seeing the same problem through multiple lenses at once.

How Is Attention Used Inside the Transformer?

  • Encoder-Decoder Attention
    The decoder uses queries from the previous decoder layer and takes keys/values from the encoder output, allowing every decoder position to attend to every position in the input sequence.
  • Encoder Self-Attention
    In the encoder self-attention layer, QQ, KK, and VV all come from the previous encoder layer output, enabling each position to attend to every other position in the encoder.
  • Decoder Self-Attention
    In decoder self-attention, each position can attend only to previous positions (including itself). This preserves the auto-regressive property by masking future tokens.

With these mechanisms, the Transformer can model dependencies regardless of token distance-without recurrence or convolution. This directly addresses the sequential bottleneck of recurrent models and enables faster training and better translation quality.

Encoder-Decoder Architecture in the Transformer

Encoding Decoding stact
Figure 1. The Transformer - model architecture. Source: Vaswani et al., 2017, "Attention Is All You Need" (2017)

Encoder Stack

The encoder is a stack of N=6N = 6 identical layers. Each layer has two sub-layers:

  1. Multi-head self-attention
  2. A position-wise fully connected feed-forward network

Each sub-layer uses a residual connection followed by layer normalization:
LayerNorm(x+Sublayer(x))LayerNorm(x + Sublayer(x))

All sub-layers and embeddings produce outputs of dimension dmodel=512d_{model} = 512 to make residual connections straightforward.

  • Residual connections are like solving a problem with hints. Instead of replacing your work, the teacher preserves what you have done and only adds what you are missing. In the same way, each Transformer layer keeps existing information and learns only a corrective update:
    x+f(x)=existingsignal+learnedadjustmentx + f(x) = existing signal + learned adjustment

Decoder Stack

The decoder is also composed of N=6N = 6 identical layers. Each decoder layer includes three sub-layers: masked self-attention, encoder-decoder multi-head attention, and a position-wise feed-forward network.

Like the encoder, each sub-layer uses residual connections and layer normalization.

Decoder self-attention is modified to prevent information from flowing from future positions. This is implemented via masking, where illegal connections in the softmax inputs are masked (set to negative infinity). This preserves the auto-regressive property by ensuring that, at position ii, the decoder can attend only to representations of previous positions in the sequence.

This architecture removes the fundamental sequential bottleneck of recurrent models, enabling significantly more parallelization and resulting in both reduced training time and improved translation quality.

In simple terms: the encoder builds a condensed understanding of the full sentence, and the decoder generates tokens based on that understanding. Both are stacked 6 layers deep, with mechanisms that filter and refine information flow.

Why Self-Attention? (Table 1)

Self-attention
Generated by Gen AI
  1. Total computational complexity per layer

    LayerDescription
    Self-attentionFaster than recurrent layers when sequence length nn is smaller than representation dimension dd; complexity O(n2d)O(n^{2} \cdot d)
    RecurrentO(nd2)O(n \cdot d^{2})
    ConvolutionDepends on kernel size kk: O(knd2)O(k \cdot n \cdot d^{2})

    The paper notes that for very long sequences, self-attention can be restricted to a neighborhood of size rr, reducing complexity to O(rnd)O(r \cdot n \cdot d). It also explains that separable convolutions with k=nk = n can have complexity similar to combining self-attention with positional feed-forward layers.

  2. Amount of parallelizable computation

    LayerDescription
    Self-attentionMaximum path length O(1)O(1): enables high parallelization
    RecurrentRequires O(n)O(n) sequential operations
    ConvolutionO(1)O(1) sequentially, but multiple layers are needed to connect all positions
  3. Path length for long-range dependencies

    LayerDescription
    Self-attentionO(1)O(1): any positions can connect through a constant number of steps
    RecurrentO(n)O(n)
    ConvolutionO(nk)O(\frac{n}{k}) for contiguous kernels; O(logkn)O(\log_{k}{n}) for dilated convolutions

Summary table:

MetricSelf-AttentionRNNCNN
Compute complexityO(n2d)O(n^{2} \cdot d)O(nd2)O(n \cdot d^{2})O(knd2)O(k \cdot n \cdot d^{2})
ParallelizationVery high (O(1)O(1))Low (O(n)O(n))Medium (O(1)O(1))
Long-range path lengthO(1)O(1)O(n)O(n)O(nk)O(\frac{n}{k}) to O(logkn)vO(\log_{k}{n}) v
  • Side benefit: self-attention can make models more interpretable. Individual heads often learn distinct roles, and many heads exhibit behaviors tied to syntactic or semantic structure.

By leveraging these advantages, the Transformer can train much faster than prior architectures while delivering higher translation quality.

Position-Wise Feed-Forward Networks

Each encoder and decoder layer includes a fully connected feed-forward network applied independently to each position (the same way across positions). It consists of two linear transformations with a ReLU in between:
FFN(x)=max(0,xW1+b1)W2+b2FFN(x) = max(0, xW_{1} + b_{1})W_{2} + b_{2}

The same transformation is applied at every position, but each layer uses different parameters. This can also be viewed as two 1×11 \times 1 convolutions.
Input/output dimension is dmodel=512d_{model} = 512, and the inner dimension is dff=2048d_{ff} = 2048. While the paper does not explicitly justify 2048, it reports better performance with larger dffd_{ff}:

  • dff=2048d_{ff} = 2048: BLEU 25.8
  • dff=1024d_{ff} = 1024: BLEU 25.4
  • dff=4096d_{ff} = 4096: BLEU 26.2 (Table 3)
  • You can think of the feed-forward network as a “filtering module” that refines each token’s meaning to better fit its context.*

Positional Encoding

Since the Transformer has neither recurrence nor convolution, it needs a way to inject information about token positions. The paper introduces positional encodings, which are added to the input embeddings at the bottom of the encoder and decoder stacks. Positional encodings have dimension dmodeld_{model} so they can be added directly to embeddings.

Implementation

  • Uses sine and cosine functions:
    • even dimensions
    PEpos,2i=sinpos100002idmodelPE_{pos, 2i} = \sin{\frac{pos}{10000^{\frac{2i}{d_{model}}}}}
    • odd dimensions
    PEpos,2i+1=cospos100002idmodelPE_{pos, 2i + 1} = \cos{\frac{pos}{10000^{\frac{2i}{d_{model}}}}}
  • Here, pospos is the absolute position in the sentence.

Even if the formula looks complex, the idea is simple: it assigns a unique, structured “pattern” to each position—essentially giving every token a mathematical coordinate.

Why this design?
The authors hypothesize that for any fixed offset kk, PEpos+kPE_{pos+k} can be represented as a linear function of PEposPE_{pos}, making it easier for the model to learn relative position attention. It may also help the model extrapolate to sequences longer than those seen during training.

Positional encoding is like a navigator: it gives each token a coordinate so the model understands not just meaning, but ordering. It is like knowing not only the content of a book but also what page you are on—because the same sentence can mean something completely different at the beginning versus the end.
Since the Transformer has no recurrence, it does not inherently know word order. So it assigns position information—like a page number—using sine/cosine patterns. This allows the model to learn order-based meaning, such as “who modifies whom” and how a sentence unfolds, like pinning GPS coordinates onto words.

Training

Data

The paper trains the model on two datasets:

  • WMT 2014 English-German with ~4.5M sentence pairs
  • WMT 2014 English-French, a much larger dataset with ~36M sentence pairs

For EN-DE, they use a shared source-target vocabulary of about 37,000 tokens and apply byte-pair encoding (BPE).
For EN-FR, they split text into 32,000 word pieces.

Batching and Schedule

Sentence pairs were batched together by approximate sequence length. Each batch contained about 25,000 source tokens and 25,000 target tokens.

Training used 8 NVIDIA P100 GPUs:

  • Base model: 100,000 steps (~12 hours), ~0.4 seconds/step
  • Big model: 300,000 steps (~3.5 days), ~1.0 second/step

Optimization and Regularization

  • Optimizer: Adam with ϵ1=0.9\epsilon_{1} = 0.9, ϵ2=0.98\epsilon_{2} = 0.98, ϑ=109\vartheta = 10^{-9}
  • Learning rate schedule: Increases linearly for warmup_steps = 4000, then decays proportionally to the inverse square root of step count
  • Regularization: Residual dropout (Base: Pdrop=0.1P_{drop} = 0.1) and label smoothing (ϑls=0.1\vartheta_{ls} = 0.1)

Results

The Transformer achieved strong performance in machine translation and English constituency parsing.

  1. Machine Translation
ModelEN-DE BLEUEN-FR BLEUTraining Cost (FLOPs)Notes
Transformer (Base model)27.338.12.210182.2 \cdot 10^{18} (EN-DE)Outperformed prior models with less training cost
Transformer (Big model)28.441.82.310192.3 \cdot 10^{19} (EN-DE)New state-of-the-art on both tasks
  • WMT 2014 EN-DE:
    • Big Transformer reached 28.4 BLEU, over 2.0 BLEU higher than the best previous model (including ensembles).
    • Base model also beat all prior models and ensembles using only a fraction of their training cost.
  • WMT 2014 EN-FR:
    • Big Transformer achieved 41.8 BLEU, setting a new single-model state-of-the-art, again with only a fraction of the training cost reported in prior work.
  1. Model Variations (Ablations)
    On the newstest2013 dev set, the paper evaluates how different components affect performance:
  • Multi-head attention: a single head performs ~0.9 BLEU worse than the best configuration; too many heads also degrade quality.
  • Key size dkd_{k}: reducing key size lowers quality.
  • Scale: larger models perform better; dropout helps prevent overfitting.
  • Positional encoding: learned positional embeddings perform nearly the same as sinusoidal encodings. Sinusoidal was chosen because it may extrapolate better to longer sequences.
  1. English Constituency Parsing
    The Transformer also generalizes well beyond translation.
  • Using only WSJ training set (~40K sentences), it achieved 91.3 F1.
  • In a semi-supervised setup with ~17M sentences, it achieved 92.7 F1, outperforming all previously reported models except Recurrent Neural Network Grammar.
  • Unlike RNN seq2seq models, the Transformer trained only on WSJ still outperformed the Berkeley Parser.

Conclusion

The Transformer is the first sequence transduction model to completely replace recurrent layers with multi-head self-attention.

Key takeaways:

  • Superior performance: new state-of-the-art results on both WMT 2014 EN-DE and EN-FR. In EN-DE, the best model even surpassed all previously reported ensemble models.
  • Efficiency: trains much faster than recurrent- or convolution-based architectures.
  • Parallelization: by removing recurrence and convolution and relying entirely on attention, the Transformer enables far more parallel computation.

Closing

The Transformer was a breakthrough that overcame the limitations of RNNs—enabling both parallelization and long-range dependency learning. The models that followed—GPT, BERT, T5, Gemini, and more—all trace their starting point back to this paper. As the title says, the essence of the architecture is simple: Attention is all you need.

namdarine will keep creating content that makes foundational technologies like this easier to understand—and easier to use.


📌 namdarine’s AI Review is a series that breaks down papers, algorithms, and architectures so anyone can understand the core ideas behind AI.

Let’s build it like it’s already happened.
→ See you in the next review!