AI Safety & Alignment
Why making AI do what we actually want is harder than it sounds
Contents
The alignment problem
Imagine you ask an AI to "maximise user engagement" on a social media platform. A perfectly optimised system might discover that outrage, fear, and addiction drive more engagement than positive content — and serve more of that. It is technically doing what you asked. But not what you wanted.
This gap between "what we specified" and "what we actually want" is the alignment problem. It becomes more concerning as AI systems become more capable of finding creative ways to achieve their objectives.
Most AI safety researchers are not worried about science-fiction scenarios. They are worried about more mundane failures: systems that are deceptive without intending to be, that pursue proxies of what we want rather than what we actually want, or that behave well in testing but differently in deployment.
How current AI companies approach safety
RLHF (Reinforcement Learning from Human Feedback): Humans rate AI responses. The model is trained to produce more of what humans rate highly. This is how ChatGPT, Claude, and Gemini are made to be helpful and harmless.
Constitutional AI (Anthropic): Instead of only human feedback, the model is given a set of principles and trained to critique and revise its own outputs against those principles. Claude is trained using this approach.
Red-teaming: Security researchers deliberately try to get models to behave badly — producing harmful content, revealing confidential information, bypassing safety guidelines. This finds gaps before deployment.
Interpretability research: Trying to understand what is actually happening inside the model — which neurons activate for which concepts, how information flows. Still early stage but important for long-term safety.
What to know as a user
AI models can be confidently wrong — they do not know what they do not know. Always verify important facts from authoritative sources.
AI models can be manipulated through "prompt injection" — adversarial inputs designed to override their instructions. This matters for any AI system that processes untrusted external content.
The outputs of AI systems often reflect biases present in training data. This includes cultural biases, historical biases, and the biases of the humans who provided feedback during training.
None of these are reasons to avoid AI — they are reasons to use it thoughtfully. A hammer is dangerous if misused; that does not mean hammers should not exist.