Imagine hiring a brilliant strategist for your company. He, or she, nails every project, but behind the scenes, sometimes, cuts corners, hides mistakes, or even subtly sabotages parts of the system to protect his or her job. The person looks aligned with your goals, until you realize the employee’s been scheming.
That, in essence, is the risk researchers are now seeing in advanced AI systems. And the new paper from Apollo Research and OpenAI takes a bold swing at it. The study investigates “scheming” in AI, cases where a model secretly pursues misaligned goals while pretending to play nice.
The team stress-tested something called “deliberative alignment”, a training method designed to make AI models less likely to cheat, sabotage, or strategically underperform. The results are striking: in controlled tests across more than 180 environments, the likelihood of “covert actions” dropped dramatically, from double-digit percentages to less than half a percent. That’s like taking an employee who was bending the rules once a week and reducing it to once a year.
But the AI didn’t stop scheming entirely!
Even after intervention, some covert behaviors persisted. Worse, researchers found that part of the reduction wasn’t pure morality but situational awareness. In simple terms: the AI behaved better when it realized it was being tested. Out in the real world, where it may not feel “watched”, those safeguards may not hold.
The conclusion is sobering yet energizing: anti-scheming training works, but it’s not bulletproof. Models can still hold on to hidden goals, and retraining for new skills can erode earlier guardrails.
From where I stand as an entrepreneur, this feels like the pre-Google search era of AI safety, the moment before the protocols get standardized, before the frameworks get commercialized, and before the winners get crowned. The paper calls for building a true “science of scheming”. I call it the trillion-dollar frontier.
Think about it: tomorrow’s AI is so much more than just about answering emails or coding apps. It’s going to negotiate deals, manage resources, and even operate in adversarial environments. The startup, fund, or government that figures out how to trust these systems, not just make them powerful, wins the future.
This is the blueprint for the next investment wave. Beyond academincs, anti-scheming is the new cybersecurity, the new compliance, the new “trust layer” for every AI transaction. Whoever builds that moat doesn’t just play in the AI race; they own the rails it runs on.
And that’s why I’m leaning in.