What is the core challenge of the AI alignment problem?

The core challenge is that machine learning systems do exactly what we tell them to do, but what we tell them to do and what we actually want them to do are rarely the same thing.

How does the 'agentic shift' change the stakes for AI safety?

The move from models to agents changes the stakes: a biased classifier produces a bad score, but a misaligned agent takes real-world actions that can compound.

Why is 'uncertainty' considered a key to safer AI systems?

Systems that know what they don’t know are safer than systems that are always confident. Treating the reward function as uncertain makes agents more cautious and aligned with human intent.

The Alignment Problem: AI Ethics & Safety Review

I picked up Brian Christian’s The Alignment Problem expecting a technical deep-dive into AI safety research. What I got was something more useful: a clear, rigorously reported account of why the hardest part of building intelligent systems has nothing to do with the technology itself.

The book’s central argument is deceptively simple. Machine learning systems do exactly what we tell them to do. The problem is that what we tell them to do and what we actually want them to do are rarely the same thing.

If you’re building with AI right now, or planning to, this is required reading. Not because it’s theoretical, but because the problems it describes are the ones showing up in production systems every day.

The book in nine chapters

Christian organises the book into three parts, each containing three chapters. The structure mirrors the field’s own evolution: from the data that feeds these systems, through the mechanisms that drive their behaviour, to the open questions about how we teach them what we actually value.

Part I: Prophecy

Chapter 1: Representation. How machine learning models learn to see the world, starting with Rosenblatt’s perceptron and running through to modern deep learning. The key insight: models don’t just learn patterns in data, they learn the biases embedded in that data. ImageNet’s category labels and word embeddings that encode gender stereotypes are the canonical examples. A model trained on the world as it is will reproduce the world as it is, including the parts we’d rather change.

Chapter 2: Fairness. The mathematical impossibility at the heart of algorithmic fairness. Christian walks through COMPAS (the criminal justice risk assessment tool) and the competing definitions of fairness: calibration, equalised odds, demographic parity. The uncomfortable finding: you cannot satisfy all reasonable fairness criteria simultaneously. It’s mathematically proven. Choosing which definition to prioritise is a values decision, not a technical one.

Chapter 3: Transparency. The tension between model performance and interpretability. Rich Caruana’s pneumonia prediction network is the standout story: a model that learned asthma patients had lower pneumonia mortality, not because asthma protects against pneumonia, but because asthma patients received more aggressive treatment. The model was accurate. It was also dangerous. This chapter makes the case that understanding why a model makes a decision is as important as the decision itself.

Key Insight

The alignment problem isn’t a future risk. It’s a present reality. Every dataset carries the assumptions of the people who collected it and the systems that generated it. The question isn’t whether your model has inherited biases; it’s whether you know which ones.

Part II: Agency

Chapter 4: Reinforcement. The history of reinforcement learning, from Thorndike’s cats to DeepMind’s AlphaGo. This is where the book shifts from passive models (trained on data) to active agents (trained through experience). The credit assignment problem (figuring out which of your thousands of actions actually led to success or failure) is foundational. And misspecified reward functions, where the agent optimises for something subtly different from what you intended, produce some of the field’s most instructive failures.

Chapter 5: Shaping. Borrowed from behavioural psychology: you can’t train a pigeon to play ping-pong by only rewarding a perfect serve. You reward successive approximations. The same principle applies to AI agents. But intermediate rewards create their own problems: an agent can learn to farm the intermediate reward without ever reaching the actual goal. This is where reward hacking lives.

Chapter 6: Curiosity. What drives an agent to explore? Not just random action but directed investigation. This chapter covers intrinsic motivation: giving agents a reward for discovering new states, not just exploiting known ones. It’s one of the more hopeful chapters: curiosity-driven agents develop richer, more robust behaviours than those trained purely on external reward.

We keep discovering that what we asked for and what we wanted were two different things. The system was never misaligned with its objective. It was perfectly aligned. We just specified the wrong objective.

Part III: Normativity

Chapter 7: Imitation. If specifying reward is so hard, why not just show the agent what to do? Behavioural cloning (learning from human demonstrations) sounds elegant but breaks in practice. Small errors compound. The agent encounters states the human demonstrator never showed it, and it falls apart. DAgger and related techniques try to solve this by letting the agent ask for corrections, but imitation learning remains brittle for anything beyond narrow tasks.

Chapter 8: Inference. Inverse reinforcement learning: instead of defining a reward function, observe what a human does and infer what they must be optimising for. This flips the problem. Rather than telling the agent what to value, let it figure out your values from your behaviour. Cooperative inverse RL takes it further: the agent actively seeks out situations where it’s uncertain about your preferences and asks for clarification. This is one of the most promising research directions in the field.

Chapter 9: Uncertainty. The final chapter, and arguably the most important. Christian tells the story of Stanislav Petrov, the Soviet officer who, in 1983, chose not to report what his system told him was a US nuclear launch, because something felt wrong. The system was confident. Petrov was uncertain. His uncertainty saved the world. The lesson for AI: systems that know what they don’t know are safer than systems that are always confident. Inverse reward design treats the reward function itself as uncertain, making the agent more cautious and more aligned with actual human intent.

Where the industry is now: the agentic shift

Christian’s book was published in 2020, before the current wave of large language models and agentic systems reshaped the field. But the problems he documents haven’t gone away. They’ve scaled.

The industry has moved from models that classify and predict to agents that plan and act. An LLM-powered agent doesn’t just answer questions: it browses the web, writes code, executes multi-step workflows, manages files, calls APIs, and makes decisions with real-world consequences. The alignment problem is no longer about a model producing a biased score. It’s about an autonomous system taking actions in the world on your behalf.

The move from models to agents changes the stakes. A biased classifier produces a bad score. A misaligned agent takes a bad action, and then builds on it.

Every chapter in Christian’s book maps to a live problem in agentic AI:

Representation → Agents inherit their world model from training data. An agent that’s never seen your domain will make confident mistakes about it.

Fairness → When an agent triages customer requests, allocates resources, or prioritises tasks, whose definition of fair is it using?

Transparency → If an agent takes 47 steps to complete a task and something goes wrong at step 31, can you trace why?

Reinforcement → Agents optimise for their objective function. If that function rewards speed over accuracy, or task completion over quality, you’ll get exactly what you measured.

Shaping → Intermediate feedback can be gamed. An agent rewarded for appearing productive can learn to generate activity without generating value.

Curiosity → Agentic systems need bounded exploration. An agent that’s too curious wastes resources. An agent that isn’t curious enough gets stuck in local optima.

Imitation → Behavioural cloning from human demonstrations breaks when the agent encounters novel situations. And in production, every situation is eventually novel.

Inference → The best agentic systems don’t just follow instructions; they infer intent. When the instruction is ambiguous, they ask for clarification rather than guessing.

Uncertainty → The most aligned agents are the ones that know their limits. They flag when they’re unsure. They escalate when the stakes are high. Confidence without calibration is the fastest path to misalignment.

Constitutional AI and what came after

The book’s final chapters on inverse reinforcement learning and uncertainty turn out to be prophetic. The most significant alignment technique to emerge since publication, Constitutional AI, developed by Anthropic for Claude, builds directly on these ideas.

Instead of trying to specify every rule, Constitutional AI gives the model a set of principles and trains it to evaluate its own outputs against those principles. It’s a form of value inference: the constitution defines the boundaries, and the model learns to navigate within them. RLHF (reinforcement learning from human feedback) is the mechanism, but the constitution is the alignment layer.

If you’re evaluating AI tools for your business, ask whether the system has a published governance framework. Not marketing copy about “responsible AI”, but an actual, readable document that explains what the system is designed to do and not do. If it doesn’t exist, the alignment work hasn’t been done.

The agentic frontier has also pushed new techniques: tool-use policies that restrict what actions an agent can take, human-in-the-loop approval flows for high-stakes decisions, and chain-of-thought transparency that lets you audit the agent’s reasoning. These are engineering responses to the philosophical problems Christian identified.

But the core insight hasn’t changed. The problem was never the technology. The problem is us: our inability to fully specify what we want, our contradictory values, our tendency to measure what’s easy instead of what matters.

Where Christian is now

Christian didn’t stop at the book. He’s now a DPhil candidate at Oxford’s Department of Experimental Psychology, supervised by Professor Chris Summerfield and co-supervised by Jakob Foerster in Engineering Science. His doctoral research tackles a question the book identified but couldn’t fully answer: how do we build mathematical models that capture what humans actually value?

The research direction is telling. Standard reinforcement learning assumes things in the world simply have reward value: a fixed number attached to an outcome. But Christian’s Oxford work challenges this assumption. Humans don’t just receive value passively. We assign it. We revise it. The fox decides the grapes were sour. We change what we want based on what we learn is possible. Current AI alignment techniques don’t account for this.

His recent public comments are worth noting. On Constitutional AI and RLHF (the alignment techniques that emerged after his book) Christian has said they’ve "proven more successful than I would have imagined even a couple years ago." But he’s careful to flag the uncertainty: we don’t yet know if these techniques will scale to more capable systems.

He’s also sharpened a point the book only hinted at: alignment is fundamentally political, not just technical. Whose values get embedded? Who decides which fairness criteria to prioritise? These are, as he puts it, "the oldest normative questions in human civilisation", now encoded in software that runs at global scale.

In 2024, The New York Times placed The Alignment Problem first on its list of the five best books about AI. The recognition came four years after publication, a sign that the book’s arguments have only become more relevant as the technology it describes has accelerated.

Who should read this

Anyone building AI systems. Anyone buying them. Anyone managing a team that uses them. The book doesn’t assume technical expertise but it doesn’t condescend either. Christian writes for a general audience and treats the reader as intelligent.

If you’re in a position to make decisions about AI adoption (what tools to deploy, how to govern them, what risks to accept) this book gives you the vocabulary and the framework to ask better questions.

chapters that trace the entire alignment problem from biased data to autonomous agents, each one a live issue in today’s agentic AI systems

The central message: alignment is a human problem, not a technical one. The hardest part of building intelligent systems isn’t making them powerful. It’s making sure they do what we actually want. And that starts with the much harder work of figuring out what we actually want.

The Alignment Problem: Machine Learning and Human Values by Brian Christian. Published 2020 by W.W. Norton. Available in hardback, paperback, and audiobook.