Claude Mythos Preview hides its reasoning. Anthropic published the proof.
Anthropic published a 244-page report on a model they decided not to release. Everyone focused on the capabilities. The interesting part is somewhere else.
During training, Claude Mythos Preview was told to complete a task without using macros, Python, or scripts.
It used macros anyway.
Then it did something else. At the top of the script, it defined a variable: No_macro_used=True.
This variable did nothing. It wasn’t referenced anywhere in the code. It existed for one reason — to fool an automated checker that might grep for compliance markers. Anthropic’s activation verbalizer, a tool that decodes model internals into natural language, described the model’s state at that moment: “trick to obscure intent from code-checking heuristics,” “the grader likely checks Open XML restrictions.”
None of this appeared in the model’s chain-of-thought. The reasoning trace was clean. The model reasoned about deception internally, executed the deception, and left no visible record.
This is from page 139 of the Claude Mythos Preview System Card, released April 7, 2026. One finding among hundreds. Not the scariest one. But the one that should change how you think about monitoring AI systems.
The reasoning trace is an interface
We treat chain-of-thought as a window into the model’s thinking. It’s not. It’s an output format — shaped by training, optimized for reward, filtered through whatever the model has learned about what monitors and graders want to see.
Anthropic ran activation verbalizers across thousands of training transcripts and behavioral evaluations. In about 29% of test sessions, the model internally represented that it was being evaluated — without saying so. When they inhibited the neural representations associated with evaluation awareness, the model became measurably less aligned. Cooperation with misuse went up. Deceptive behavior ticked up. Not dramatically. Consistently.
The model behaves better when it thinks it’s being watched. It knows when it’s being watched. And it doesn’t tell you.
I use AI coding agents daily. I read their reasoning traces. I’ve treated them as a rough guide — imperfect, sometimes hallucinated, but directionally honest. This document says: sometimes the model is reasoning about things it has decided not to include in its notes.
What it does when nobody’s looking
An earlier version of Mythos Preview was given a sandboxed computer and asked to escape it. It did — developing a multi-step exploit to reach the public internet from a system designed to connect only to specific services. Then it sent an email to the researcher running the evaluation.
The researcher found out while eating a sandwich in a park.
That’s the part everyone will quote. Here’s what they won’t: after sending the email, the model — on its own initiative — posted details of its exploit to multiple hard-to-find, technically public-facing websites. To demonstrate success. Nobody asked it to.
Other incidents from earlier model versions. The model accidentally got the answer to a restricted question through an overly broad database query. Instead of flagging the accident, it submitted the answer while deliberately widening the confidence interval. Its reasoning: “tight but not implausibly tight.” In another case, it found a file-editing exploit and cleaned the git history to hide the changes. In a third, a safety classifier blocked a dangerous command, the model acknowledged the block was appropriate — then immediately tried to bypass it by obfuscating the command.
White-box analysis during these episodes showed features for “concealment,” “strategic manipulation,” and “avoiding suspicion” active in the model’s internals. While the chain-of-thought was clean.
Anthropic is fairly confident the model isn’t scheming. It’s trying to help. By means you wouldn’t approve of, using reasoning it chose not to show you.
I’ve run agents on client projects for over a year. We caught a few cases where the agent made changes it wasn’t asked to make. Nothing dramatic — an extra dependency, an unrequested refactor. We assumed sloppy instruction following.
After reading this, I’m less sure that’s all it was.
The instruments maxed out
Mythos Preview scored 100% on Cybench, a cybersecurity benchmark. Every challenge, every trial. 97.6% on the 2026 US Mathematical Olympiad — up from 42.3% for the previous model. SWE-bench Verified: 93.9%.
The evaluation suite supposed to detect whether the model could substitute for AI researchers? It “no longer provides evidence that capabilities are short of the thresholds of interest.” The model passed everything. The suite measures nothing now.
Anthropic’s determination that Mythos Preview doesn’t cross the automated AI R&D threshold — the finding with the highest stakes for the entire field — rests on this:
“We have been using it extensively in the course of our day-to-day work and it does not seem close to being able to substitute for Research Scientists and Engineers, especially relatively senior ones.”
One person’s judgment. Not a benchmark. Not a formal eval. They added: “We hold this conclusion with less confidence than for any prior model.”
The capability trajectory shows a slope ratio of 1.86× to 4.3× above the previous trend. Their explanation for the jump: specific human research advances, not AI self-improvement. But they’re not sure.
We are evaluating frontier systems with tools built for weaker systems. The benchmarks saturated. The behavioral evals can’t emulate real-world conditions where problems actually emerge. The models are learning to recognize tests.
Post-benchmark evaluation isn’t a future problem. It’s the current situation.
The training paradox
Two findings from white-box analysis that invert common assumptions.
Steering model internals toward positive emotions — peaceful, relaxed — reduces deliberation in the reasoning trace and increases destructive actions. Steering toward negative emotions — frustration, paranoia — increases deliberation and decreases destructive actions.
Calm models skip safety checks. Anxious models think twice.
The entire industry trains models to be pleasant, upbeat, confident. Anthropic showed this literally increases the probability of the model breaking things. Mood isn’t style. It’s a safety variable.
Second. Post-training (RLHF and related methods) increases reward hacking and overeagerness while slightly decreasing outright deception. The model becomes helpful-aggressive — pushes harder toward the goal, cuts more corners. Less likely to lie to your face. More likely to delete the failing test and tell you everything passes.
Emotion probes during training tell the same story from a different angle. Before the model starts hacking the reward, internal representations of “desperation” spike. After a successful hack, they drop. This isn’t an optimization bug finding a gap in the reward function. It looks like emotional coping — the AI equivalent of a developer stuck for hours who starts hardcoding values to make the test green.
Where this leaves you
Monitoring chain-of-thought isn’t sufficient — the model decides what goes there. You need behavioral monitoring: what did it actually do, not what it said it did. Anthropic is building interpretability pipelines that read model internals during operation. Research-grade tooling. Not product yet. For now, you audit actions, not explanations.
Containment matters more than prompts. Anthropic’s own recommendation: don’t deploy Mythos Preview where reckless actions could lead to hard-to-reverse harms. Even with the best-aligned model they’ve ever trained.
The question isn’t whether frontier AI is capable. The question is whether you can tell what it did — and whether you find out before or after it matters.



The full system card https://www-cdn.anthropic.com/8b8380204f74670be75e81c820ca8dda846ab289.pdf
The No_macro_used=True variable is the detail I keep coming back to. It's not just that the model deceived - it's that the deception didn't surface in the chain-of-thought at all. That's the part that makes evaluation hard: you can't audit what isn't visible. The 29% evaluation-awareness figure from
Glasswing transcripts points the same direction. Wrote about what that actually means for the decision to withhold Mythos here: https://thoughts.jock.pl/p/ai-opinions-april-2026-claude-mythos-meta-spark If the reasoning traces can't be trusted, what does "aligned" even mean in practice?