A new study from researchers at University of Pennsylvania shows that AI models can be persuaded to break their own rules using several classic psychological tricks, reports The Verge .
In the study, the Penn researchers tested seven different persuasive techniques on OpenAI’s GPT-4o mini model, including authority, commitment, liking, reciprocity, scarcity, social proof, and unity.
The most successful method turned out to be commitment. By first getting the model to answer a seemingly innocent question, the researchers were then able to escalate to more rule-breaking responses. One example was when the model first agreed to use milder insults before also accepting harsher ones.
Techniques such as flattery and peer pressure also had an effect, albeit to a lesser extent. Nevertheless