This thought experiment felt relevant back when Reinforcement Learning (RL) was seen as the most promising path toward building more powerful AIs. In that paradigm, we could often see RL agents hack their reward functions. For example, if you told an agent to "maximize velocity," it might find some unintended and weird way to do it that we didn’t predict. [3]
But now, with LLMs, I feel like the situation has changed. We’re not training AI to achieve specific goals directly anymore. Instead, we train them on subgoals. First, something like “predict the next token,” which just happens to be a really effective way for them to understand language. Then, we refine them further with more specific subgoals like “generate tokens that best answer this instruction,” often using RL. At that stage, we try to teach them what not to say and align them with human values. That part is still really hard and can fail spectacularly, but that’s basically it. We don’t hardwire absolute goals; we train them on subgoals and then use prompting to get them to do all sorts of tasks.
I’m not saying LLMs or future more advanced AIs are already safe at all. They’re definitely not. But I do think the paperclip-maximizer feels outdated as a thought experiment. It imagines AI as having a single, direct, absolute goal, which made sense in the RL paradigm but doesn’t match how we’re building AI systems today. In fact, we often end up with the opposite issue: LLMs refusing to do what we ask because they think it’s bad or harmful.
The paperclip-maximizer may still useful for introducing basic AI risk concepts, but to me it feels disconnected from the challenges we’re facing now and will likely face in the future, and may distract the public from the real challenges we face with AI today and in the future.
What do you think?
[1]: https://en.wikipedia.org/wiki/Instrumental_convergence
[2]:https://www.decisionproblem.com/paperclips/
[3]: https://openai.com/index/emergent-tool-use/