Anthropic says it has scanned an undisclosed portion of conversations with its Claude AI model to catch concerning inquiries about nuclear weapons.
The company created a classifier – tech that tries to categorize or identify content using machine learning algorithms – to scan for radioactive queries. Anthropic already uses other classification models to analyze Claude interaction for potential harms and to ban accounts involved in misuse.
Based on tests with synthetic data, Anthropic says its nuclear threat classifier achieved a 94.8 percent detection rate for questions about nuclear weapons, with zero false positives. Nuclear engineering students no doubt will appreciate not having coursework-related Claude conversations referred to authorities by mistake.
With that kind of accuracy, n