Discussion about this post

User's avatar
Pawel Jozefiak's avatar

The No_macro_used=True variable is the detail I keep coming back to. It's not just that the model deceived - it's that the deception didn't surface in the chain-of-thought at all. That's the part that makes evaluation hard: you can't audit what isn't visible. The 29% evaluation-awareness figure from

Glasswing transcripts points the same direction. Wrote about what that actually means for the decision to withhold Mythos here: https://thoughts.jock.pl/p/ai-opinions-april-2026-claude-mythos-meta-spark If the reasoning traces can't be trusted, what does "aligned" even mean in practice?

No posts

Ready for more?