Anthropic has announced a new experimental safety feature that allows its Claude Opus 4 and 4.1 artificial intelligence models to terminate conversations in rare, persistently harmful or abusive scenarios. The move reflects the company's growing focus on what it calls "model welfare," the notion that safeguarding AI systems, even if they're not sentient, is a prudent step in alignment and ethical design.
According to Anthropic's own research, the models were programmed to cut off dialogues after repeated harmful requests, such as for sexual content involving minors or instructions facilitating terrorism, especially when the AI had already refused and attempted to steer the conversation constructively. The AI may exhibit what Anthropic describes as "apparent distress," which guided the