Phonies
Anthropic's self-interested actions can still be prosocial
This weekend a friend complained to me that Anthropic’s recent article on recursive self-improvement (RSI) was a convenient ploy. Of course they would like us to consider a pause now—it’s the perfect means to secure their lead!
I felt this was an unfair characterization. I had heard similar comments about Mythos and Project Glasswing. Did Anthropic really have a model capable of epochal new cybersecurity attacks? Or was this all a marketing scam?
We have since learned that ChatGPT 5.5 Pro can find many of the same security issues as Mythos. But it is exactly by hyping up the cybersecurity capabilities of Mythos that organizations first began to take this threat seriously. The security reports coming from partners in Project Glasswing have not been universally positive. But I doubt we would have ended up with a new executive order without the publicity.
The conversation on regulation now appears to be moving apace, with both Anthropic and OpenAI having made tentative first hints at a mutual pause.
It is possible to act in a way that benefits oneself and which is positive for society. Publishing about RSI and signaling willingness for a mutual pause might indeed have financial benefits for Anthropic (it also might not; wouldn’t a pause negatively effect that evaluation?). But a mutual pause, followed by sensible regulation, could also benefit society.
A few commentators seem to hold the moral philosophy of Holden Caulfield, cynically attacking any new attempt at safety as phony.
Yesterday we witnessed initial blowback on Fable’s new safety mechanisms. Most notably, Fable will silently degrade its own responses on requests related to frontier model development. Researchers are (understandably!) concerned that this will undermine their ability to use Fable for capabilities and even safety development, exacerbated by a silent failure mode.
Is Anthropic being a phony here, proposing safety mechanisms that really only enshrine its own dominance?
I have no idea! So long as Anthropic retains a lead, will we have a good mechanism to distinguish between self-interested and prosocial decisions? If not, we ought to focus on ensuring that their decisions are consistent with prosocial behavior, rather than treating self-interest as proof of wrongdoing.
If we accept that we might be on the verge of a dangerous new regime of advanced AI capabilities and recursive self-improvement, then we will need solutions to halt the distillation of frontier models and prevent “rogue RSI.” I am sympathetic to complaints that Anthropic has made a unilateral decision to prevent researchers from scaffolding on top of their intelligence. It is bad if Anthropic recognizes the risks of further development while allowing themselves to proceed. But advocating for a turn to open-source models or other providers misses the point of guardrails altogether. I am very scared of a world in which open-source models are unrestricted and have the intelligence of Fable or greater!
The move to a regime of greater AI safety controls will be messy and will make people mad. I’m glad the community is asking tough questions of Anthropic, but I hope we will remain receptive to the inherent tradeoffs here.
A few questions for the next couple of weeks to better assess Anthropic’s motivations (whether or not they are big fat phonies):
Does Anthropic begin to advocate more loudly for a pause or global regulation? If yes, then this is a sign Anthropic’s “unilateral” decision to slow down AI development is genuine. If no, this is a sign Anthropic is acting from self-interest.
Does Fable allow alignment and safety research? If yes, this is a sign they are trying to narrowly halt RSI. A no is a bit tricky. Does the topic indirectly contribute to AI advancement? Are the safety mechanisms unintentionally broad/bad?
Does Anthropic offer a concrete explanation for why they have to silently degrade intelligence on frontier development, rather than refuse a response? If yes, then this may be an unfortunate consequence of foreign distillation efforts. If no, then this may have been a poorly considered mechanism to slow adversaries, and we should advocate for clear refusals.

