⚡

Transformer Architecture

The architecture powering every major AI model — explained without the maths

Intermediate 7 min read

1. Why transformers replaced everything else
2. Attention: the key mechanism
3. Encoder vs Decoder vs Encoder-Decoder
4. Scale: why bigger works better

Why transformers replaced everything else

Before 2017, AI language models processed text word by word — like reading left to right, keeping a running "memory." This was slow to train, struggled with long text, and forgot context from early in the sentence.

The transformer architecture, introduced in the paper "Attention Is All You Need" (2017), changed this completely. Instead of processing sequentially, transformers process the entire input at once — every word looking at every other word simultaneously.

This "parallel processing" made training massively faster (GPUs excel at parallel operations) and allowed models to capture long-range dependencies that previous models missed.

Attention: the key mechanism

The core innovation is the attention mechanism. For each word (token) in the input, attention calculates a score against every other word — answering "how much should this word pay attention to every other word?"

Example: In "The trophy wouldn't fit in the suitcase because it was too big" — when processing "it", attention determines that "it" refers to "trophy" (not "suitcase") by assigning a high attention score to "trophy."

Multi-head attention runs this process in parallel multiple times — each "head" learning to attend to different types of relationships (grammar, semantics, coreference, etc.). GPT-4 has 96 attention heads across 96 layers.

Encoder vs Decoder vs Encoder-Decoder

Different tasks need different transformer configurations:

Encoder-only (BERT, RoBERTa): Reads text and builds a deep understanding of it. Great for classification, search, sentiment analysis. Not for generating text.

Decoder-only (GPT, Claude, Llama): Generates text by predicting the next token. Sees only what came before. This is what most LLMs use.

Encoder-Decoder (T5, BART): Reads input with an encoder, generates output with a decoder. Used for translation, summarisation, and question answering where the output format differs from the input.

Scale: why bigger works better

A remarkable finding in AI research: transformer models simply get smarter as they get bigger, in a surprisingly predictable way. Doubling the parameters, data, and compute reliably improves performance — this is called scaling laws.

GPT-2 (2019): 1.5 billion parameters. Impressive but limited. GPT-3 (2020): 175 billion parameters. Surprisingly capable. GPT-4 (2023): Estimated 1+ trillion parameters. Near-human performance on many tasks.

No one fully understands why scale produces qualitative improvements (like the ability to reason, count, or code) rather than just quantitative ones. This emergent behaviour is one of the most studied and debated topics in AI research.

Keep learning

🧠

Transformer Architecture

Contents

Why transformers replaced everything else

Attention: the key mechanism

Encoder vs Decoder vs Encoder-Decoder

Scale: why bigger works better

Keep learning

Large Language Models (LLMs)

RAG — Retrieval Augmented Generation

AI Agents